The word “analytics” is an interesting and often abused term in the world of application monitoring. For the sake of correctness, I’m going to reference Wikipedia in how I define analytics:
Analytics is the discovery and communication of meaningful patterns in data.
Simply put, analytics should make IT’s life easier. Analytics should point out the bleeding obvious from all the monitoring data available, and guide IT so they can effectively manage the performance and availability of their application(s). Think of analytics as “doing the hard work” or “making sense” of the data being collected, so IT doesn’t have to spend hours figuring out for themselves what is being impacted and why.
This is about how effectively a monitoring solution can self-learn the environment it’s deployed in, so it’s able to baseline what is normal and abnormal for the environment. This is really important as every application and business transaction is different. A key reason why many monitoring solutions fail today is that they rely on users to manually define what is normal and abnormal using static or simplistic global thresholds. The classic “alert me if server CPU > 90%” and “alert me if response times are > 2 seconds,” both of which normally result in a full inbox (which everyone loves) or an alert storm for IT to manage.
The communication bit of analytics is equally as important as the discovery bit. How well can IT interpret and understand what the monitoring solution is telling them? Is the data shown actionable–or does it require manual analysis, knowledge or expertise to arrive at a conclusion? Does the user have to look for problems on their own or does the monitoring solution present problems by itself? A monitoring solution should provide answers rather than questions.
One thing we did at AppDynamics was make analytics central to our product architecture. We’re about delivering maximum visibility through minimal effort, which means our product has to do the hard work for our users. Our customers today are solving issues in minutes versus days thanks to the way we collect, analyze and present monitoring data. If your applications are agile, complex, distributed and virtual then you probably don’t want to spend time telling a monitoring solution what is normal, abnormal, relevant or interesting. Let’s take a look at a few ways AppDynamics Pro is leveraging analytics:
Seeing The Big Picture
Seeing the bigger picture of application performance allows IT to quickly prioritize whether a problem is impacting an entire application or just a few users or transactions. For example, in the screenshot to the right we can see that in the last day the application processed 19.2 million business transactions (user requests), of which 0.1% experienced an error. 0.4% of transactions were classified as slow (> 2 SD), 0.3% were classified as very slow (> 3 SD) and 94 transaction stalled. The interesting thing here is that AppDynamics used analytics to automatically discover, learn and baseline what normal performance is for the application. No static, global or user defined thresholds were used – the performance baselines are dynamic and relative to each type of business transaction and user request. So if a credit card payment transaction normally takes 7 seconds, then this shouldn’t be classified as slow relative to other transactions that may only take 1 or 2 seconds.
The big picture here is that application performance generally looks OK, with 99.3% of business transactions having a normal end user experience with an average response time of 123 milliseconds. However, if you look at the data shown, 0.7% of user requests were either slow or very slow, which is almost 140,000 transactions. This is not good! The application in this example is an e-commerce website, so it’s important we understand exactly what business transactions were impacted out of those 140,000 that were classified as slow or very slow. For example, a slow search transaction isn’t the same as a slow checkout or order transaction – different transactions, different business impact.
Understanding the real Business Impact
The below screenshot shows business transaction health for the e-commerce application sorted by number of very slow requests. Analytics is used in this view by AppDynamics so it can automatically classify and present to the user which business transactions are erroneous, slow, very slow and stalling relative to their individual performance baseline (which is self-learned). At a quick glance, you can see two business transactions–”Order Calculate” and “OrderItemDisplayView”–are breaching their performance baseline.
This information helps IT determine the true business impact of a performance issue so they can prioritize where and what to troubleshoot. You can also see that the “Order Calculate” transaction had 15,717 errors. Clicking on this number would reveal the stack traces of those errors, thus allowing the APM user to easily find the root cause. In addition, we can see the average response time of the “Order Calculate” transaction was 576 milliseconds and the maximum response time is just over 64 seconds, along with 10,393 very slow requests. If AppDynamics didn’t show how many requests were erroneous, slow or very slow, then the user could spend hours figuring out the true business impact of such incident. Let’s take a look at those very slow requests by clicking on the 10,393 link in the user interface.
Seeing individual slow user business transactions
As you can probably imagine, using average response times to troubleshoot business impact is like putting a blindfold over your eyes. If your end users are experiencing slow transactions, then you need to see those transactions to effectively troubleshoot them. For example, AppDynamics uses real-time analytics to detect when business transactions breach their performance baseline, so it’s able to collect a complete blueprint of how those transactions executed across and inside the application infrastructure. This enables IT to identify root cause rapidly.
In the screenshot above you can see all “OrderCalculate” transactions have been sorted in descending order by response time, thus making it real easy for the user to drill into any of the slow user requests. You can also see looking at the summary column that AppDynamics continuously monitors the response time of business transactions using moving averages and standard deviations to identify real business impact. Given the results our customers are seeing, we’d say this is a pretty proven way to troubleshoot business impact and application performance. Let’s drill into one of those slow transactions…
Visualizing the flow of a slow transaction
Sometimes a picture says a thousands words, and that’s exactly what visualizing the flow of a business transaction can do for IT. IT shouldn’t have to look through pages of metrics, or GBs of log files to correlate and guess why a transaction maybe slow. AppDynamics does all that for you! Look at the screenshot below that shows the flow of a “OrderCalculate” transaction–which takes 63 seconds to execute across 3 different application tiers as shown below. You can see the majority of time spent is calling the DB2 database and an external 3rd party HTTP web service. Let’s drill down to see what is causing that high amount of latency.
Automating Root Cause Analysis
Finding the root cause of a slow transaction isn’t trivial, because a single transaction can invoke several thousand lines of code–kind of like finding a needle in a haystack. Call graphs of transaction code execution are useful, but it’s much faster and easier if the user can shortcut to hotspots. AppDynamics uses analytics to do just that by presenting code hotspots to the user automatically so they can pinpoint the root cause in seconds. You can see in the below screenshot that almost 30 seconds (18.8+6.4+4.1+0.6) was spent in a web service call “calculateTaxes” (which was called 4 times) with another 13 seconds being spent in a single JDBC database call (user can click to view SQL query). Root cause analysis with analytics can be a powerful asset for any IT team.
Verifying Server Resource or Capacity
It’s true that application performance can be impacted by server capacity or resource constraints. When a transaction or user request is slow, it’s always a good idea to check what impact OS and JVM resource is having. For example, was the server maxed out on CPU? Was Garbage Collection (GC) running? If so, how long did GC run for? Was the database connection pool maxed out? All these questions require a user to manually look at different OS and JVM metrics to understand whether resource spikes or exhaustion was occurring during the slowdown. This is pretty much what most sysadmins do today to triage and troubleshoot servers that underpin a slow running application. Wouldn’t it be great if a monitoring solution could answer these questions in a single view, showing IT which OS and JVM resource was deviating from its baseline during the slowdown? With analytics it can.
AppDynamics introduced a new set of analytics in version 3.4.2 called “Node Problems” to do just this. The above screenshot shows this view whereby node metrics (e.g. OS, JVM and JMX metrics) are analyzed to determine if any were breaching their baseline and contributing to the slow performance of the “OrderCalculate” transaction. The screenshot above shows that % CPU idle, % memory used and MB memory used have deviated slightly from their baseline (denoted by blue dotted lines in the charts). Server capacity on this occasion was therefore not a contributing factor to the slow application performance. Hardware metrics that did not deviate from their baseline are not shown, thus reducing the amount of data and noise the user has to look at in this view.
Analytics makes IT more Agile
If a monitoring solution is able to discover abnormal patterns and communicate these effectively to a user, then this significantly reduces the amount of time IT has to spend managing application performance, thus making IT more agile and productive. Without analytics, IT can become a slave to data overload, big data, alert storming and silos of information that must be manually stitched together and analyzed by teams of people. In today’s world, “manually” isn’t cool or clever. If you want to be agile then you need to automate the way you manage application performance, or you’ll end up with the monitoring solution managing you.
If your current monitoring solution requires you to manually tell it what to monitor, then maybe you should be evaluating a next generation monitoring solution like AppDynamics.