jemalloc now on the trunk

Our Windows nightlies (beta4pre, this is not in beta 3) now include jemalloc. These builds are leaps and bounds better than the last build I posted.

Tons of amazing work has gone in to this. I’d like to thank Jason for making all the crazy changes to jemalloc that we wanted and Ted for his days and days of crazy build stuff. Thanks to Benjamin for his work getting the CRT building initially — we wouldn’t be here without it. We’ve worked day and night for weeks to make this happen and it is finally here.

Due to the requirement for you to have Microsoft Visual Studio 2005 SP1 Professional, –enable-jemalloc is off by default in configure. If you have the right stuff installed toss it in your mozconfig and magic should happen.

We’re still evaluating the switch on Mac and Linux, but you can use the same configure flag to build on those platforms.

24 thoughts on “jemalloc now on the trunk

  1. Matt Doran

    How is it “leaps and bounds” better?

    Was there any movement in the memory size, fragmentation or performance since last time?

  2. Brian

    This great news for us Firefox junkies and daily users. Thanks for all your hard work guys, we look forward to a leaner, meaner FF3…

  3. pd

    Thanks for all the hard work everyone.

    I can’t wait to try this, so I will try tomorrow 🙂

    Any metrics available yet on performance improvements?

  4. bibo

    great work.

    I’m eager to see some numbers on pre-jemalloc builds and post-jemalloc.
    Would you summed up your last test results ?

  5. Ryan Jones

    Could just be me but the build seems a whole bunch faster and runs easier. Playing with things like GMail and Digg actually seem to work faster.

    Nice work, lets see if the actual statistics back this up!

  6. Wladimir Palant

    What is “the right stuff”? Could you tell what the dependencies on the Professional version are? A quick glance at the code didn’t reveal anything suspicious. I would like to enable that switch in my builds, would be nice to know what I need to install.

  7. Aaron

    Is there a possibility that jemalloc will only be turned on for one major platform (Windows) and not for the others (Mac/Linux)?

  8. pavlov Post author

    Matt: this should be both a bunch faster and a bunch smaller than before.

    Wladimir: “The right stuff” is basically Visual Studio 2005 SP1 Profession (or Enterprise). Express won’t work.

    Alex: Yeah, Express doesn’t have the source, and we can’t redistribute it, so Express is out of luck.

    Arron: It is possible. We’re evaluating each platform on its own. Windows has very clear huge wins. The Mac builds are a little bit faster, but not as significant. We’re still running Linux tests — Should have some results there today.

  9. Cory Nelson

    Hey, just thought I’d report a small portability bug I found. Near the top inside a #ifdef MOZ_MEMORY_WINDOWS you have #define SIZE_T_MAX ULONG_MAX, this should be SIZE_MAX.

    I’m also wondering why you force a call to malloc_init on Windows. It would be very simple to make a basic spinlock using _InterlockedCompareExchange and _mm_pause.

  10. pavlov Post author

    Cory: I’ll make the SIZE_MAX change — thanks. As for the forced malloc_init it has to do with ordering of CRT init. We’ve got a patch to not do lazy malloc_init calls on Windows and Mac which at least on mac is about a 2% perf win. As for spinlocks, it might be interesting to try changing the locks around a bit to see how they effect performance. CRITICAL_SECTIONS give pretty good performance, but doing inlined ones in certain spots would almost certainly be better.

  11. Cory Nelson

    Critical sections will probably be better in most cases: on single-threaded systems they immediately do a kernel wait, and on multi-threaded systems they spin for a short time then enter a kernel wait.

    I only brought up spin locks regarding the ones being initialized as globals — you would only need one (for calling malloc_init) and it would only be called once, so the perf would not be an issue.

    In my testing, multi-threaded performance left much to be desired. Windows’ heaps did worse, but jemalloc still wasn’t very great. I was testing a worst-case scenario, but an operation that took 22 (cpu-time) seconds single-threaded took 75 seconds (cpu-time) using 4 threads on a quad core.

  12. pavlov Post author

    Cory: Not sure what options with jemalloc you were trying, but we’ve mostly disabled the bits that should make threaded things fast since we aren’t a very threaded app. We’re forcing narenas to 1 with the ’10n’ option. Getting rid of this should give you an arena per thread. Turning on lazy free and balancing (I suspect they don’t work on Windows at the moment as they both use TLS stuff that I didn’t change to use TlsGet/Set). With those I would expect you to see much better wins. If you’ve got a test app, Jason and I would love to play with it. I know he’s spent a lot of time making heavily threaded apps work well on FreeBSD, but we haven’t spent any time on it on Windows. Send me mail or post a link in here.

  13. pd

    Stuart could you please confirm the build number just so we can be sure we’re testing the jemalloc-enabled version? I just downloaded build 2008020504, is that the one?

  14. alanjstr

    Gimme gimme gimme gimme! I don’t wanna wait until beta 4 :-(. But the only Firefox 3 I use is the portable version (currently b3). I’m excited for FF3, but I’ll have to get over some of the UI changes.

  15. Mark S

    So let’s say we decide not to use jemalloc on mac (or linux).
    Would we make this decision because the perf & size wins are ‘not enough’? If there is some, wouldn’t it still be worth it? Is there some other detractor about using it that would make it not worth turning on?
    Also, if it were not turned on for mac would an investigation ensue to use something else for bigger wins? (For that matter, if it *is* turned on will we still look for an alternative that gets better wins?)

  16. Ferdinand

    Does this mean that the newest nightly’s already use jemalloc? So everybody with 2008020504 should see better performance and less memory usage in the windows taskmanager? Or is it only used if you build it yourself?

  17. RyanVM

    pd, Ferdinand – If you’ve got mozcrt19.dll in your Minefield directory, you’ve got a jemalloc-enabled build.

    That said, Stuart, is there a bug filed for the DLL not having proper resource information set on it (version info and the like)? Also, I saw a comment on another blog (I think it was bsmedberg’s) about possibly linking the CRT with xul.dll. Is that a worthwhile possibility or a waste of time?

  18. Matt Doran

    Did you get a chance to update measurements of memory, performance and fragmentation?

    I’m really curious to see how much a difference these improvements made!

  19. Cory Nelson

    After playing around with the config string I got it up to about 70% scalability – a good 10% more than Windows’ heaps. The tricky thing was that despite being more scalable it performed poorly enough – contention or not – that overall it was still a big loss in throughput.

    The big perf issue appears to be in arena_purge – increasing the F flag fixed that nicely. Some scalability was lost and memory usage has increased very slightly, but overall performance is now 5-10% better than Windows’ heaps.

    I noticed JE removed lazy free completely from FBSD’s CVS, citing performance reasons. I have to agree – I ported it over and it has been a detriment in most ways I tried to use it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s