Troubleshooting OutOfMemory Exceptions and Memory Leaks in Production

Many root causes ago I was working with a customer who suspected they had a memory leak in production. Their JVM console event logs were showing the famous OutOfMemory exception and these were being thrown periodically every three to four days causing production outages. To stop these exceptions, the operations team would restart all JVMs at midnight every night in order to prevent system wide impact to customers during business hours. And if Ops forgot to restart the JVMs (which they did on several occasions), production went bang.

It’s worth pointing out at this stage that “OutOfMemory” exceptions in log files doesn’t automatically mean your application has a memory leak. It simply means your application is using or needs more memory than you’ve allocated to it at run-time. A leak is just one candidate of several potential candidates that cause memory to grow over time until all resource is exhausted.