In my last post, I wrote about how using Business Transactions as a management unit is critical for managing modern-day applications efficiently. Sticking to this train of thought, I will focus on how this applies to various aspects of application management. The first area I want to cover is monitoring and troubleshooting.
In a highly distributed system, there are 100s of CPUs, 100s of JVMs/CLRs, and millions of lines of code running. Now, if you want to attach service levels to every component of that system, you would either be eyeballing dashboards most of the time or trying to maintaining the configuration associated with alerting.
As I mentioned in the last post, if the business grows, it will require more capacity and more infrastructure—which means newer pieces (and ones that are moving around rapidly). Add the configuration attached with lines of code and you can pick either “daunting” or “impossible” to describe the task at hand. Long testing and staging cycles have become a thing of the past.
At the same time, the DevOps tribe is adapting to and embracing the new application landscape rapidly. Their biggest need is the need for speed, which translates into efficiency driven by intelligence in every aspect of application management in production. This is where using Business Transactions can make monitoring more efficient and easier to accomplish.
But wait a minute – am I really suggesting that you watch Business Transaction service levels (which are your business’ bottom line) instead of getting overwhelmed with alerts based on 1000s of key metrics? Doesn’t that mean you are ignoring things that can be going wrong by not giving them enough attention?
Au contraire! In fact, you can actually focus on more with this approach versus traditional monitoring techniques. Let me explain.
Traditional monitoring has all been about averages. You look at some service or some method and its average over time, then set up alerts associated with it. Doing so is great for catching slow performance degradation or systemic outages. But it won’t help you catch outliers where there is no pattern associated with errors or slow requests.
Let’s look at a couple of examples to understand this better. For brevity I will talk only about slow requests (but the same argument can also be applied to errors).
1) A frequent cause of slow requests, resulting in a poor user experience, pertains to a transaction associated with user input. For example, in a shopping cart application, the user might add a particular item to his or her cart—which results in the application slowing down.
2) Here’s another one. Sometimes one particular node, which is part of a big cluster of say 50 nodes, has an issue in servicing requests but might be doing ok on CPU and memory usage.
In both of these cases, averages would hide the problem. Here’s an example.
Over a period of time, an online checkout application experienced 5,000 Total Checkouts. Of those checkouts, 4,800 were Normal Checkouts and 200 were Slow Checkouts.
In cases like these, it is very likely that the average response time of the normal transaction—along with that of the bad transactions—averages out to a very normal rate. But the bottom line is that 200 users had a bad checkout, and that needs to be fixed. By focusing on business transactions only and reducing the points of monitoring, you are actually able to identify and address more performance concerns than you might have otherwise.
So what exactly are we watching here?
a) Response time for a transaction – this is averaged over all requests for this transaction. We don’t need to watch the key methods being executed in the request since the overall response time for the request is the single indicator. A method-level drill down is needed only when the response time is slow.
b) Number of slow requests over time – If we set up thresholds over the transaction and watch every request, we are able to identify exactly how many requests were outliers. Of course, for this to be useful, the system needs to be able to collect diagnostic information as slow requests happen and not afterwards. This also ensures that when the request is being serviced by a problematic node, it is represented appropriately.
We’re not the only ones who believe that monitoring Business Transactions is critical to better performance in production. The customers and prospects we speak to are serious about having a system in place that can watch all requests instead of just watching averages in their system. It’s becoming a real-world demand with real-world benefits.
The last thing I want to mention before signing off, is that AppDynamics does all of the above. We follow all requests to monitor SLAs—not just averages, but actual numbers—and we do so at an extremely low overhead, enabling us to jump in and get diagnostic data as bad requests happen!
That wasn’t too much of a plug, right? Until next time…