Introduction
There is a common understanding that a single snapshot of the java heap is not enough for finding a memory leak. The usual approach is to search for a monotonous increase of the number of objects of some class by online profiling/monitoring or by comparing a series of snapshots made over time. However, such a live monitoring is not always possible, and is especially difficult to be performed in productive systems because of the performance costs of using a profiler, and because of the fact that some leaks show themselves only rarely, when certain conditions have appeared.
In this blog I will try to give some guidelines how one could find the unwanted memory accumulation without having to sleep besides the servers in the office - with the help of the recently released SAP Memory Analyzer tool, and a couple of tricks .
Preparation
First make sure that you will get sufficient data for the troubleshooting even if the problem occurs when you are not on the system. For this purpose configure the JVM to produce a heap dump when an OutOfMemoryError occurs (see description here).
The second step of the preparation is to enable the memory leak to become more visible and easily detectable. To achieve this use the following trick configure the maximum size of the java heap to be much higher (say it twice) than the heap used when the application is running correctly (e.g. set it to be twice as much as what is usually left after a full GC). Even if you dont know how much memory the application really needs, increasing the heap is not a bad idea (it may turn out that there is no leak but simply more heap is required). I dont want to go into discussions if running Java applications with too big heaps is a good approach in general - simply use the tip for the time of the troubleshooting.
What do you gain by this change? If the VM throws an OutOfMemoryError with this configuration it will produce a heap dump in which the size of the objects related to the leak will be about the half of the total heap size, i.e. it should be relatively easy to detect the leak later.
Analysis - Case 1
Now imagine that after the latter configurations are activated, you go to the office in the morning and find that the error has reoccurred and there is a nice big heap dump in the file system. What is next? Well, believe it or not, what follows is the easier part.
First, open the heap dump with the SAP Memory Analyzer tool. One may have to wait a bit for the initial parsing if the heap dump is too big, but subsequent reopening will be instant (see some performance metrics here).
Then lets search who has eaten up the memory. Go to the Dominator tree view.
There you will find the object graph transformed in a tree a special kind of tree showing the objects dependencies, and not simply the references between them. I wont go into details about the theory behind this tree, but Ill simply list some of its key properties:
- On the top of these tree (i.e. what you see immediately after opening it) one can find the biggest objects in the heap
- All descendants of an object in the dominator tree are being retained by it (meaning that they will be garbage collected if the object is garbage collected). The biggest objects are the ones which retained most heap
In most of the cases when there is a leak one will immediately notice it by looking at the size of the biggest object. To go a bit closer to the real accumulation point one should expand the tree under the biggest object until a significant drop in the retained sizes of the parent and the children is seen (usually this will be some kind of a collection or an array). Well, its so easy. You found it! If you are interested you can also analyze the content by further exploring the dominator tree.
The next thing to do is to see the real reference chain from the GC roots. Simply call Paths from the GC roots from the context menu on the accumulation point object. In the "Paths from the GC Roots" view one can see the references with the names of the fields.
Analysis - Case 2
It would be nice if every problem was so easily found. Sometimes however the first look at the dominator tree is not enough. But one more click should make the second look sufficient. One click, but where? On the Group by class button from the toolbar. Here is some explanation. Previously we have configured big enough heap for the leak to grow. And we also have a dominator tree covering the full object graph that also includes the leak. So, why dont we see it? In the example we just looked at all the small leaking objects were dominated by one single object whose retained size was huge. But sometimes it may happen that the leaking objects themselves are on the top of the dominator tree. Even though they are many in number, each of them is small in size and is therefore not displayed among the biggest objects.
However, if we manage to find the whole group of leaking objects and see their aggregated size, then the leak will be as easily noticed as in the previous example. This is namely achieved by grouping the objects by their class. So, did you find the memory eater now? I hope the answer is "Yes". If not, please let me have the heap dump you are looking at and Ill try to extend and complete the description.