2014年12月31日星期三

Java RSS increased by memory fragmentation

Recently, I found a strange memory related problem with our product system, that the RSS (resident set size) increased over time. The Java heap utilization is less than 50%, looks like there could be a native memory leak, while it turns out something else.

Leaking Direct Buffer?

Direct Buffer is one of the potential native memory leak causes, so first  checked the Direct Buffer with the tool from Alan Bateman's blog. It shows the direct buffers as following:
          direct                        mapped
 Count   Capacity     Memory   Count   Capacity     Memory
   419  123242031  123242031       0          0          0
   419  123242031  123242031       0          0          0
   421  123299674  123299674       0          0          0
There is no strong evidence about that it's caused by direct buffer.

Per-thread malloc?

While checking the memory usage of the java process with pmap, I found some strange 64MB memory blocks, similar as described in Lex Chou's blog (Chinese). So that I tried to set the MALLOC_ARENA_MAX environment variable. Unfortunately, the problem is still not resolved.

Native Heap Fragmentation?

With further investigation, I found this problem could be caused by memory fragmentation, as described in this bug report.The malloc() implementation works fine for general applications, while it's not able/necessary to support all kinds of applications.
By using gdb, I found the real evidence:

gdb --pid <pid>
(gdb) call malloc_stats()
And got following output:

Arena 0:
system bytes     = 2338504704
in use bytes     =   69503376
Arena 1:
system bytes     =   48705536
in use bytes     =   19162544
Arena 2:
system bytes     =     806912
in use bytes     =     341776
Arena 3:
system bytes     =   17965056
in use bytes     =   17505488
Total (incl. mmap):
system bytes     = 2444173312
in use bytes     =  144704288
max mmap regions =         59
max mmap bytes   =  154546176
So there are about 2.4GB memory been allocated from system, but only used about 144MB. This is a strong indicator of problem, so that I set MALLOC_MMAP_THRESHOLD_ to 131072, and monitor the result. Seems the RSS could draw down after long running, but it still raised too high (9G).

Conclusion

After monitoring the application for long time, the actual problem is  complicated and caused by multiple problems. First, the heap fragmentation is the major contributor of this problem, second, this application creates lots of transient objects, and some direct byte buffers are kept for little longer time. Which means those byte buffers are moved to old generation because of frequent young GC. After that there is very few GC in old generation since it's not full. So that those byte buffers are not garbage collected.
To resolve this problem, a small CMSInitiatingOccupancyFraction is used together with UseCMSInitiatingOccupancyOnly option. then the total RSS looks quite stable now.