A few weeks back I met with a customer who had issues, the expression on their face said it all. It started with an apology that several people couldn’t make our meeting, why? because they were investigating a production outage. You might think I’ve just made that up, I can assure you this was real and a frequent event which I’ve witnessed many a time. It can be especially annoying when you’ve travelled many miles to chat with a customer expecting to have a productive meeting and then the alarm bells ring. However, an outage in this scenario just validates the reason why you’re there in the first place.
The meeting kicked off and it was obvious the customer had been experiencing performance issues for several weeks. They openly admitted they had monitoring toolsets deployed in dev, test and production and that we’d need to help them justify why our solution should be deployed in production. That sounds like a weird statement to make but actually its not, someone was clearly responsible for the existing monitoring solutions in place and was adamant they would solve the issues that needed to be solved but weren’t being solved. I guess you can call it politics, the bottom line is that the customer didn’t have many options available to them, it was either find another solution or continue investigating production outages with the current toolset. There is an important lesson here, if the monitoring tools you own aren’t delivering the results you expect, you really need to share this with the vendor who promised you those results. And if they don’t want to listen to your thoughts, give me a call and I’ll help you out.
The meeting progressed and I asked a simple question ” you have an APM solution already deployed, so what exactly is the problem?”. The response I recieved was “our production slowdowns and outages come and go, they aren’t predictable and we can’t run our existing APM solution in production continuously due to overhead”. To combat this overhead, the existing monitoring solution was only being deployed after issues would surface in production, hoping they’d still have time to find the root cause. It wasn’t easy either for the app support team who had to enable and disable this monitoring. They had previously set thresholds and configuration so they could tell the monitoring solution what to look for without killing the application altogether with monitoring overhead. Turned out, they had been doing this on and off for the last three weeks and wasted a lot of time trying to react and piece together what happened. The tone of the customers voice became one of frustration as they spoke about their issues, they just wanted their pain to go away and was prepared to do whatever it took.
The good news was that a few hours later the customer was up and running with a different monitoring solution in production, and the results as I expected were rather different. A few hours? how so? the customer on this occasion was keen to solve their issues. However, they didn’t have spare hardware to install our monitoring infrastructure so chose our multi tenant SaaS deployment option which allows AppDynamics to provide application monitoring infrastructure as a service, similar to how Saleforce.com provide its CRM software. All the customer had to do was register for a SaaS login, install our lightweight agents on their application tiers and they were up and running, monitoring their application in production with AppDynamics. Sure enough, after a few minutes monitoring data began to flow, their application topology was mapped, business transactions were discovered and the customer was seeing performance data they’d never seen before. The surprise for the customer on this occasion was that they’d configured nothing other than installing a few of our standard agents. They genuinely couldn’t believe how monitoring could be this fast and easy.
Two days later, sure enough the customer had an application slowdown just after lunch. However, this time they were monitoring every business transaction running through their application. They noticed a specific business transaction was continuously stalling shortly before one application JVM eventually hung. When the customer looked at the diagnostic data for this specific transaction they noticed from the call graph analysis that all the response time was spent waiting for a remote 3rd party web service request to complete. These waiting web service requests were causing the JVM thread pool to become exhausted over-time which was causing user requests to eventually timeout making the JVM totally unresponsive. A key point here was that the customers application wasn’t at fault, it was simply connecting to a slow service provider and waiting for it to respond. Visually seeing this bottleneck and having evidence of what was causing it meant the customer was able to have a frank conversation with their web service provider to resolve the issue. Besides feeling huge relief for the customer, it restored their confidence in APM toolsets. 24/7 production monitoring became a mandate overnight, the customer just wasn’t prepared to compromise given the effort and time they’d wasted previously with other tools.
Application Performance Management solutions have moved on, they don’t have to be complex, intrusive and expensive anymore. The next time you feel pain in production you should evaluate an APM solution like AppDynamics. You can request a free 30 day trial here and be up and running in just a few hours like the customer mentioned above.
This customer example is just one example of why application performance management is still unsolved. If you’re not monitoring production today and you’ve already bought tools, ask yourself and your vendor why that is the case.