Recently, I found a strange memory related problem with our product system, that the RSS (resident set size) increased over time. The Java heap utilization is less than 50%, looks like there could be a native memory leak, while it turns out something else.
By using gdb, I found the real evidence:
To resolve this problem, a small CMSInitiatingOccupancyFraction is used together with UseCMSInitiatingOccupancyOnly option. then the total RSS looks quite stable now.
Leaking Direct Buffer?
Direct Buffer is one of the potential native memory leak causes, so first checked the Direct Buffer with the tool from Alan Bateman's blog. It shows the direct buffers as following:
direct mapped
Count Capacity Memory Count Capacity Memory
419 123242031 123242031 0 0 0
419 123242031 123242031 0 0 0
421 123299674 123299674 0 0 0
There is no strong evidence about that it's caused by direct buffer.Per-thread malloc?
While checking the memory usage of the java process with pmap, I found some strange 64MB memory blocks, similar as described in Lex Chou's blog (Chinese). So that I tried to set the MALLOC_ARENA_MAX environment variable. Unfortunately, the problem is still not resolved.Native Heap Fragmentation?
With further investigation, I found this problem could be caused by memory fragmentation, as described in this bug report.The malloc() implementation works fine for general applications, while it's not able/necessary to support all kinds of applications.By using gdb, I found the real evidence:
gdb --pid <pid> (gdb) call malloc_stats()And got following output:
Arena 0:
system bytes = 2338504704
in use bytes = 69503376
Arena 1:
system bytes = 48705536
in use bytes = 19162544
Arena 2:
system bytes = 806912
in use bytes = 341776
Arena 3:
system bytes = 17965056
in use bytes = 17505488
Total (incl. mmap):
system bytes = 2444173312
in use bytes = 144704288
max mmap regions = 59
max mmap bytes = 154546176
So there are about 2.4GB memory been allocated from system, but only used about 144MB. This is a strong indicator of problem, so that I set MALLOC_MMAP_THRESHOLD_ to 131072, and monitor the result. Seems the RSS could draw down after long running, but it still raised too high (9G).Conclusion
After monitoring the application for long time, the actual problem is complicated and caused by multiple problems. First, the heap fragmentation is the major contributor of this problem, second, this application creates lots of transient objects, and some direct byte buffers are kept for little longer time. Which means those byte buffers are moved to old generation because of frequent young GC. After that there is very few GC in old generation since it's not full. So that those byte buffers are not garbage collected.To resolve this problem, a small CMSInitiatingOccupancyFraction is used together with UseCMSInitiatingOccupancyOnly option. then the total RSS looks quite stable now.