Product

Anomaly Detection and Root Cause Analysis with AppDynamics Cognition Engine

By | | 6 min read


Summary
By unifying AIOps and application intelligence, AppDynamics Cognition Engine empowers enterprises to gain critical insights from trillions of metrics. Let's see it in action.

When we talk about AppDynamics’ powerful Cognition Engine, we’re talking about the automatic detection of anomalies in business transactions. Cognition Engine enables you to pinpoint deviating application performance, isolate resource problems, and respond with automated actions. By leveraging a variety of specialized machine learning (ML) models, AppDynamics can detect anomalies in real time and at scale. This enables us to detect problems minutes—or even hours—before a system that relies on traditional thresholding.

Anomaly Detection and Root Cause Analysis: A Step-By-Step Guide

Let’s take a close look at anomaly detection and root cause analysis enabled by the AppDynamics Cognition Engine. In this blog, we’ll learn how to:

  • Quickly detect an issue within an application
  • Drill down to determine the issue’s root cause
  • Enable anomaly detection and root cause analysis

 

Below is a sample application, ERetail-Pass, which we’ll use for this demo. If you’re unfamiliar with AppDynamics, the diagram in the center of the screen is the Application Flow Map, which is created automatically when you install an AppDynamics agent.

 

 

The AppDynamics agent automatically discovers the various components and how they communicate with each other. Our demo app has a number of different Java services, as well as automatically discovered queues and Redis caches. The Application Flow Map shows a number of KPIs, including load, response time and errors. 

The screen below shows a performance issue that has just occurred. The response time, which dramatically increased around 10:30 a.m., has coincided with a drop in load.

 

 

At this point, some KPIs show we have a problem—but we’re not sure why. The right side of the screen shows that a number of health rule violations have been triggered:  

 

 

We also see something new—an anomaly event. Let’s take a closer look at this by clicking Anomalies Started.

 

 

Above we see that AppDynamics has detected a major anomaly, which has started in the “Critical” state on a Business Transaction checkout. From here we can drill deeper and view more details on this particular anomaly, and why it occurred.

Below is our new Anomaly Detection and Root Cause Analysis screen. In the upper left corner, we see the anomaly is now resolved. The timeline shows the anomaly started at around 10:39 a.m., transitioned to a warning at 11:09 a.m. and was resolved at 11:13 a.m. Since the timeline is interactive, you can click on different states—for example, the critical state in red, or transitional state in yellow—for more details. Since the anomaly started as critical, let’s explore that segment of the timeline. 

 

 

The left side of the screen (above, under /r/Checkout) shows high-level information on why the anomaly occurred. We see it occurred on a checkout transaction and that the following two metrics were deviating: 95th Percentile Response Time, and Average Response Time. 

Further down the left side, we’ve identified the top two suspected causes for why the 95th Percentile and Average Response Time are deviating. (In a moment, we’ll drill into these to show how we determine root cause.)

 

 

In AppDynamics, actions can be associated with a policy. Here we see two actions taken— HTTP notifications to a Slack channel to alert people of an issue with this particular business transaction. 

 

 

AppDynamics has a variety of other actions that you can execute, such as email, SMS messages, or integrations with third parties such as ServiceNow. We also provide Transaction Snapshots for each business transaction. If, for instance, you’d like to dive deeper into the root cause of a particular issue, you can drill down into individual Transaction Snapshots.

 

 

Now let’s take a look at the suspected causes. The Application Flow Map shows a couple of items in red:

 

 

By hovering over the red item on the right, we see its root cause is related to the payment service tier. 

 

 

On the left side of the screen under Top Suspected Causes, we see there’s an issue with process CPU usage. This could mean the payment service has a CPU problem, which is affecting the business transaction.

 

 

The next identified suspected cause is the order service, which makes calls to the payment service. It has been impacted as well. 

 

 

Since we’ve identified the payment service issue as the top root cause, let’s explore it further to see what we can uncover.

 

 

The Top Deviating Metrics timeline shows the metrics associated with this business transaction. Above, we see the 95th percentile and average response time. The gray bar shows the expected range for this particular metric. We see the 95th percentile and average response time are both very elevated. Both are much higher than the gray bar, which is supposed to be 500 to 700 milliseconds (ms), but in fact is around 1600ms. 

Around 11:05 a.m., both metrics return to the normal range. That’s where the anomaly ends. Bottom line: It’s clear we have a problem with both metrics, which are especially elevated.

 

 

We then scroll down to Suspected Cause Metrics. Here we see two metrics, Process CPU Used and Process CPU Burnt on a particular node. It’s clear these metrics correspond and correlate very closely with the business transaction metrics. We now know that a process issue with CPU usage is causing this particular tier to slow down, which in turn is causing the overall business transaction to slow.

Recap

By drilling into this business transaction, we were able to detect an issue with the 95th percentile response time and the average response time. We then drilled into the top suspected cause and saw that the payment service had a CPU issue. Specifically, CPU usage was at a very high level; this was impacting the payment service, which the order service relies on. This in turn was slowing the business transaction.

How to Enable Anomaly Detection in AppDynamics

Now that we’ve identified an application issue and its root cause, let’s take a quick look at how to enable AppDynamics’ anomaly detection and root cause analysis. 

 

 

Enabling anomaly detection and root cause analysis is very straightforward. In the Alert & Respond screen (above), you’ll notice a new section called Anomaly Detection. When we click into Anomaly Detection, we see a toggle that can be turned on or off. This toggle enables Anomaly Detection for every business transaction in your application.

 

 

It’s that easy! No other configuration is required. Keep in mind that Anomaly Detection has to be enabled, which you can do on a per-application basis. Once enabled, we’ll need 24-48 hours to build your machine learning models, after which we’ll continually monitor and alert you when we find issues with your business transactions.

Anomaly events are very similar to health rule violations in AppDynamics, and you’ll see them appear in the Event tabs. You can associate anomaly events with policies and also create actionings, which we saw above in the Slack integration example.

We hope this step-by-step walkthrough gives you a clear view of the powerful capabilities of anomaly detection and root cause analysis in AppDynamics.