Allocation Data

Lots of people have asked where most of the allocations in Mozilla come from. I’ve gzipped some dtrace output that shows number of calls per size of stacks 5 deep. Note: This log only shows allocations <= 2048 bytes. This data is pretty raw but if people want to take a look at it and see if they have ideas for how to improve some of the code paths in question, that would be great.

An example:

  libSystem.B.dylib`malloc+0x37
  XUL`nsStringBuffer::Alloc(unsigned long)+0x15
  XUL`nsACString_internal::MutatePrep(unsigned int, char**,
                                      unsigned int*)+0xce
  XUL`nsACString_internal::ReplacePrep(unsigned int,
                                       unsigned int,
                                       unsigned int)+0x46
  XUL`nsACString_internal::Assign(char const*, unsigned int)+0xc8

       value  ------------- Distribution ------------- count
           4 |                                         0
           8 |@@@@@@@@@@@@@                            9724
          16 |@@@@@@@@@@@@@@@@                         11940
          32 |@@@@@@                                   4359
          64 |@                                        887
         128 |@@                                       1604
         256 |@                                        497
         512 |@                                        676
        1024 |                                         47
        2048 |                                         0

This shows that there are 9724 8 byte allocations, 11940 16 byte ones and so on.

Things to look for include:

  • Things with lots of allocations
  • Things that could be stack allocated to avoid memory churn
  • Things that the lifetime is well understood that we could put in to pools
  • etc…

Edit: I’ve also posted another log with 8 frame deep stacks as well as a log that only includes allocations post-startup (also 8 frames deep).

7 thoughts on “Allocation Data

  1. Benjamin Smedberg

    Pav, is it possible to get the information for large allocs also? At the moz2 meetup today we were pondering whether pldhash/jsdhashtable allocations (which can be very large blocks of memory) could be causing unnecessary fragmentation, and whether a plhash (chaining) system might be better for certain allocation types.

    Reply
  2. Boris

    COW is not the problem per se, really. The main callers of MutatePrep I see in the logs there, inclusing startup, are:

    * CDATA section parsing (1500)
    * AppendASCIItoUTF16 (~1000)
    * nsCookieService::GetCookieInternal (~1500)
    * nsHttpHeaderArray::Flatten (~1200)
    * AppendUTF16toUTF8 (~2200)
    * AppendUTF8toUTF16 (~800)
    * nsStandardURL::BuildNormalizedSpec (~3000)
    * nsHttpHeaderArray::SetHeader (~8600)
    * nsStandardURL::BuildNormalizedSpec (~3500)
    * nsCacheService::CreateRequest (~2500)
    * nsStandardURL::SetRef (~6000)
    * nsStandardURL::Resolve (~700)
    * nsStandardURL::GetPath (~400)

    I sort of have to stop, but unfortunately I’ve mostly looked at pretty big allocations there (a lot of the URI stuff does things in the 256+ byte range). For strings, it might be nice to do another log focusing on that 8-32 range.

    The SetRef calls are from LoadBindingDocumentInfo. I wonder why we even hit that code 6000 times… I guess we have a lot of bindings. 😦

    Could we only allocate once in BuildNormalizedSpec somehow?

    nsCacheService::CreateRequest should at the very least preallocate the right string size. Another option would be to use a key with two strings in it, not just a concatenation of the two strings as a single string key (and hope that if we use two strings they can both share their buffers with caller).

    We should really make cookies output nsACString, not a char** and see how much of a difference that makes. I expect some. I also wonder whether it’s worth precomputing the length we’ll need and allocating it all at once instead of going through and doing one append per cookie. Depends on the typical cookie count, I guess…

    Maybe we should do 9-frame stacks to see the callers of the string conversion functions?

    Reply
  3. Brendan Eich

    Benjamin: Using a single large allocation for a double hashtable instead of lots of little ones for a hashtable with chaining is the right answer if the entry size is small enough. See the big comment near the top of jsdhash.h. If there is a user of {js,pl}dhash configuring too large an entry size, you get a warning in DEBUG builds.

    But you didn’t mention entry size, so I’m wondering why you think a single large allocation would fragment worse than a bunch of small ones? The table growth uses malloc and free, not realloc. Ignoring overlarge entry size, a double hashtable is strictly less fragmenting than an open table with chaining.

    /be

    Reply
  4. Pingback: Leaks? Memory? We never forgot about you. « pavlov.net

  5. Steve

    Heavy use of the stack has it’s own problems.

    If a given function uses more than 4096 bytes of stack space, Microsoft’s compiler will add code to ensure that subsequent pages of memory are actually present. This makes the code larger and slower.

    I wouldn’t be surprised to see other compilers generating similar code.

    https://bugzilla.mozilla.org/show_bug.cgi?id=359453

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s