Troubleshooting OutOfMemory Exceptions and Memory Leaks in Production

image_pdfimage_print


Many root causes ago I was working with a customer who suspected they had a memory leak in production. Their JVM console event logs were showing the famous OutOfMemory exception and these were being thrown periodically every three to four days causing production outages. To stop these exceptions, the operations team would restart all JVMs at midnight every night in order to prevent system wide impact to customers during business hours. And if Ops forgot to restart the JVMs (which they did on several occasions), production went bang.

It’s worth pointing out at this stage that “OutOfMemory” exceptions in log files doesn’t automatically mean your application has a memory leak. It simply means your application is using or needs more memory than you’ve allocated to it at run-time. A leak is just one candidate of several potential candidates that cause memory to grow over time until all resource is exhausted.

A common root cause is when applications are deployed in production with default JVM/CLR memory settings–or more to the point, incorrect memory settings. All applications are different; some are small, some are big, some have few libraries, some have hundreds of libraries, some libraries are a few MB, and others are tens of MB. When an app is loaded into memory at run-time, no customer really gets a sense of whether their JVM/CLR has enough memory to cope (they just assume) even before users start hitting the JVM/CLR with requests and sessions.  For example, the default memory pool size for PermGen space (where classes are stored) is 64MB on Sun JVMs. This might sound reasonable but I’ve seen plenty of customer applications that have tens of libraries with many dependencies that will exhaust 64MB easily and cause OutOfMemory to occur. A simple -XX:MaxPermSize=128m configuration change on the JVM to compensate for the size of the application libraries will prevent the PermGen space blowing up with a java.lang.OutOfMemoryError: PermGen space exception. I knew of several customer applications in the past, which resolved OutOfMemory exceptions by simply increasing the MaxPermSize setting on the JVM.

Using a solution like AppDynamics you can easily monitor the different memory pool sizes over time to understand just how close your application is exhausting memory so you can better finetune your JVM/CLR memory settings.

Understand the true utilization of your memory pools.

Another common reason for OutOfMemory exceptions is when the application queries large amounts of data from relational databases and tries to persist and process it in JVM/CLR memory. This might sound obvious, but you cannot really change the law of physics. If your JVM/CLR heap is set to 256MB and is serving 1000 users, each requesting around 250Kb of data per request, then you’re running dangerously close to exhausting the JVM/CLR memory. I’ve also seen this happen with wildcard query search transactions on web applications when tables of data are literally dumped into JVM memory via JDBC ResultSets, causing the heap to blow with the java.lang.OutOfMemoryError: Java heap space exception. Data access is an expensive operation, so keep it to a minimum; let the database do the hard work for you (querying) and only bring back the data you need to the JVM.

With AppDynamics you can also do some cool stuff like track heap usage over time, object count and physical size (MB) of the objects residing in memory. This gives you great visibility into how much data is being persisted in the JVM/CLR at any one time and how much of your memory is being exhausted by different types objects and data structures. You can also correlate this information with garbage collection cycles to understand how often memory is reclaimed by the JVM/CLR.

Track your JVM/CLR Heap usage over-time

Understand object count and physical size in memory

If you indeed have a memory leak, then things are a little bit more complicated to resolve. The first approach is to try and reproduce the memory leak in a dev or test environment using tools like a profiler, which will step through code execution and show the state of memory, heap, and object allocation as you hit the application with requests. The problem with this scenario is that you can never replicate the state or load of production, and if you do the overhead of the profiler will drastically change the behavior and state of the application, potentially crashing it. Another manual approach is to take heap dumps to try and piece together the contents of what is being allocated to JVM/CLR memory at a point in time. This snapshot approach means you have to take several dumps over time to learn what data structures and objects are growing with an upwards trend. The big problem with this approach is that performing heap dumps significantly impacts the performance of the JVM/CLR for several minutes. Obviously, this is a big “no no” in production! Quote from IBM website: “Although heap dumps are generated only in response to a detected memory leak, you must understand that generating heap dumps can have a severe performance impact on WebSphere Application Server for several minutes.” So considering the overhead restrictions of dumps and profilers, there is no guarantee you’ll solve the memory leak in production, which leaves you in a tricky position. You can either recycle your JVMs/CLRs every night to prevent OutOfMemory exceptions occurring or you can use an APM production solution like AppDynamics.

Big Dumps Means Big Overhead

AppDynamics recently introduced new product capabilities to automatically track and flag data structures in the JVM/CLR that are potentially leaking. It was designed for high throughput production environments to allow operations, app support and developers to find the root cause of memory leaks fast. Our engineers took what they learned over the past decade and put this intelligence into our product so that it can automatically spot, collect, flag and notify the user with what data structure is leaking. For every leaking data structure it can show the contents (object count, trend and physical size), the business transaction, and the call stack responsible for the leak–pretty much everything you need to perform root cause analysis and solve the leak. It removes all the time, effort and pain associated with manually troubleshooting leaks in production. It does the hard work so the user doesn’t have to.  Let’s face i–spending hours trawling through thread dumps and profilers isn’t fun!

App Dynamics can spot and flag leaking data structures:

For each leaking data structure you can inspect its contents to see object count and size:

To get to root cause you can then identify the business transaction and call stack responsible for accessing and invoking the leaking data structure:

Next time you have a memory leak in production, try using AppDynamics Pro. It’s fast, easy and does all the hard work for you. If you want to start today you can download AppDynamics Lite and see for yourself how easy managing application performance is!

App Man.

 

 

 

 

Copyright © 2014 AppDynamics. All rights Reserved.