jemalloc now on the trunk
Our Windows nightlies (beta4pre, this is not in beta 3) now include jemalloc. These builds are leaps and bounds better than the last build I posted.
Tons of amazing work has gone in to this. I’d like to thank Jason for making all the crazy changes to jemalloc that we wanted and Ted for his days and days of crazy build stuff. Thanks to Benjamin for his work getting the CRT building initially — we wouldn’t be here without it. We’ve worked day and night for weeks to make this happen and it is finally here.
Due to the requirement for you to have Microsoft Visual Studio 2005 SP1 Professional, –enable-jemalloc is off by default in configure. If you have the right stuff installed toss it in your mozconfig and magic should happen.
We’re still evaluating the switch on Mac and Linux, but you can use the same configure flag to build on those platforms.
Explore posts in the same categories: MozillaTags: CRT, firefox, jemalloc, memory fragmentation, Mozilla
You can comment below, or link to this permanent URL from your own site.
February 5, 2008 at 3:04 am
How is it “leaps and bounds” better?
Was there any movement in the memory size, fragmentation or performance since last time?
February 5, 2008 at 3:20 am
This great news for us Firefox junkies and daily users. Thanks for all your hard work guys, we look forward to a leaner, meaner FF3…
February 5, 2008 at 3:21 am
Thanks for all the hard work everyone.
I can’t wait to try this, so I will try tomorrow
Any metrics available yet on performance improvements?
February 5, 2008 at 3:29 am
great work.
I’m eager to see some numbers on pre-jemalloc builds and post-jemalloc.
Would you summed up your last test results ?
February 5, 2008 at 3:34 am
Could just be me but the build seems a whole bunch faster and runs easier. Playing with things like GMail and Digg actually seem to work faster.
Nice work, lets see if the actual statistics back this up!
February 5, 2008 at 4:48 am
What is “the right stuff”? Could you tell what the dependencies on the Professional version are? A quick glance at the code didn’t reveal anything suspicious. I would like to enable that switch in my builds, would be nice to know what I need to install.
February 5, 2008 at 7:24 am
So VC2005 Express Edition’s left out for now?
February 5, 2008 at 10:04 am
Is there a possibility that jemalloc will only be turned on for one major platform (Windows) and not for the others (Mac/Linux)?
February 5, 2008 at 11:04 am
I’ll post a full set of numbers shortly. Still need to tune. I know of several tests which are ~70x faster now
February 5, 2008 at 12:04 pm
Matt: this should be both a bunch faster and a bunch smaller than before.
Wladimir: “The right stuff” is basically Visual Studio 2005 SP1 Profession (or Enterprise). Express won’t work.
Alex: Yeah, Express doesn’t have the source, and we can’t redistribute it, so Express is out of luck.
Arron: It is possible. We’re evaluating each platform on its own. Windows has very clear huge wins. The Mac builds are a little bit faster, but not as significant. We’re still running Linux tests — Should have some results there today.
February 5, 2008 at 1:12 pm
Hey, just thought I’d report a small portability bug I found. Near the top inside a #ifdef MOZ_MEMORY_WINDOWS you have #define SIZE_T_MAX ULONG_MAX, this should be SIZE_MAX.
I’m also wondering why you force a call to malloc_init on Windows. It would be very simple to make a basic spinlock using _InterlockedCompareExchange and _mm_pause.
February 5, 2008 at 2:34 pm
Cory: I’ll make the SIZE_MAX change — thanks. As for the forced malloc_init it has to do with ordering of CRT init. We’ve got a patch to not do lazy malloc_init calls on Windows and Mac which at least on mac is about a 2% perf win. As for spinlocks, it might be interesting to try changing the locks around a bit to see how they effect performance. CRITICAL_SECTIONS give pretty good performance, but doing inlined ones in certain spots would almost certainly be better.
February 5, 2008 at 3:23 pm
Critical sections will probably be better in most cases: on single-threaded systems they immediately do a kernel wait, and on multi-threaded systems they spin for a short time then enter a kernel wait.
I only brought up spin locks regarding the ones being initialized as globals — you would only need one (for calling malloc_init) and it would only be called once, so the perf would not be an issue.
In my testing, multi-threaded performance left much to be desired. Windows’ heaps did worse, but jemalloc still wasn’t very great. I was testing a worst-case scenario, but an operation that took 22 (cpu-time) seconds single-threaded took 75 seconds (cpu-time) using 4 threads on a quad core.
February 5, 2008 at 3:39 pm
Cory: Not sure what options with jemalloc you were trying, but we’ve mostly disabled the bits that should make threaded things fast since we aren’t a very threaded app. We’re forcing narenas to 1 with the ’10n’ option. Getting rid of this should give you an arena per thread. Turning on lazy free and balancing (I suspect they don’t work on Windows at the moment as they both use TLS stuff that I didn’t change to use TlsGet/Set). With those I would expect you to see much better wins. If you’ve got a test app, Jason and I would love to play with it. I know he’s spent a lot of time making heavily threaded apps work well on FreeBSD, but we haven’t spent any time on it on Windows. Send me mail or post a link in here.
February 5, 2008 at 7:06 pm
Ah, I did not realize that. I will have to play with the options a bit more then.
February 5, 2008 at 7:35 pm
Stuart could you please confirm the build number just so we can be sure we’re testing the jemalloc-enabled version? I just downloaded build 2008020504, is that the one?
February 5, 2008 at 9:15 pm
Gimme gimme gimme gimme! I don’t wanna wait until beta 4
. But the only Firefox 3 I use is the portable version (currently b3). I’m excited for FF3, but I’ll have to get over some of the UI changes.
February 5, 2008 at 11:06 pm
So let’s say we decide not to use jemalloc on mac (or linux).
Would we make this decision because the perf & size wins are ‘not enough’? If there is some, wouldn’t it still be worth it? Is there some other detractor about using it that would make it not worth turning on?
Also, if it were not turned on for mac would an investigation ensue to use something else for bigger wins? (For that matter, if it *is* turned on will we still look for an alternative that gets better wins?)
February 6, 2008 at 5:19 am
Does this mean that the newest nightly’s already use jemalloc? So everybody with 2008020504 should see better performance and less memory usage in the windows taskmanager? Or is it only used if you build it yourself?
February 6, 2008 at 10:10 am
pd, Ferdinand – If you’ve got mozcrt19.dll in your Minefield directory, you’ve got a jemalloc-enabled build.
That said, Stuart, is there a bug filed for the DLL not having proper resource information set on it (version info and the like)? Also, I saw a comment on another blog (I think it was bsmedberg’s) about possibly linking the CRT with xul.dll. Is that a worthwhile possibility or a waste of time?
February 6, 2008 at 2:52 pm
mozcrt19.dll ? Got it.
February 12, 2008 at 3:26 am
Did you get a chance to update measurements of memory, performance and fragmentation?
I’m really curious to see how much a difference these improvements made!
February 13, 2008 at 8:44 pm
Could somebody post a recommended testing procedures so we can compare the performance difference?
A bookmark file and program to use etc…
Thanks
February 20, 2008 at 3:14 pm
After playing around with the config string I got it up to about 70% scalability – a good 10% more than Windows’ heaps. The tricky thing was that despite being more scalable it performed poorly enough – contention or not – that overall it was still a big loss in throughput.
The big perf issue appears to be in arena_purge – increasing the F flag fixed that nicely. Some scalability was lost and memory usage has increased very slightly, but overall performance is now 5-10% better than Windows’ heaps.
I noticed JE removed lazy free completely from FBSD’s CVS, citing performance reasons. I have to agree – I ported it over and it has been a detriment in most ways I tried to use it.