Machine Learning Anomaly
Detection Methods for APM
Building artificial intelligence and machine learning into your application
performance monitoring (APM) strategy is the best way to drive proactive
remediation. Let’s compare this with more manual methods.
When an entity in your application behaves abnormally, you’ll want to know where the anomaly occurred so you can identify the root cause and take action quickly. Whether this anomaly is good for business (say, excess traffic due to a Black Friday discount that drove up sales) or bad (a performance glitch that’s impacting revenue), being able to trace the root cause is critical for resolving potential application performance problems before they impact users.
Anomaly detection kicks off this process — and machine learning methods are increasingly being used to automate it.
Why? Well, in the past, another viable option was to manually define what constitutes “normal” application behavior. This was no easy task using legacy monitoring tools, which required operations or SRE professionals to guesstimate thresholds for all metrics measured by applications, frameworks, databases, operating systems, virtual machines, hypervisors, and underlying storage. But it was doable when most businesses had only a handful of metrics and smaller datasets to track.
Now, the modern business processes infinitely higher volumes of data. Rapid change in applications is the new norm, and having to continually adjust thresholds to meet these ever-changing needs is simply unrealistic.
So the principles of anomaly detection remain the same: Create a baseline for normal behavior, taking into account traffic and load variables like past performance or seasonality, then evaluate new data points against that model, separating the important alerts from the false positives. However, with the sheer volume of data and the hundreds (if not millions) of metrics to manage today, manual anomaly detection cannot scale and issues in data go undetected or take too long to resolve — driving up MTTR, risking SLA compliance and customer trust, and reducing IT’s bandwidth to innovate.
What’s frustrating is that answers lie right there in the data, waiting to be found. IT teams just need more support connecting those dots, and quickly.
This is where machine learning comes in, as part of a full-stack observability model, to provide powerful insights that would be difficult to obtain via manual means alone. Let’s dig into how it works.
In short, machine learning automates anomaly detection to drive proactive remediation.
Automated anomaly detection uses machine learning anomaly detectiion algorithms to automatically determine whether a business transaction in your application is performing normally, so that you don’t have to manually configure application health rules.
Then, automated root cause analysis (RCA) comes after anomaly detection to investigate further. It uses machine learning to determine the root cause of the performance problems revealed by anomaly detection.
Together, automated anomaly detection and RCA help reduce three key incident metrics:
- The mean time to detect (MTTD) when an anomaly occurs in a business transaction
- The mean time to investigate (MTTI) the cause of the incident
- The mean time to resolution (MTTR), or the time it takes to resolve the incident
The idea is to triage issues to the right teams to take actions at exactly the right time — whether that’s negating the suspected cause, or confirming it before drilling down into the exact location of the problem and analyzing logs, snapshots, traces, infrastructure, and so on to discover all impacted components.
Ultimately, by leveraging a variety of specialized machine learning (ML) models, an application performance monitoring solution can detect anomalies in real time and at scale. This enables us to detect problems minutes — or even hours — before we did using systems that rely on traditional thresholding.
Automated anomaly detection is best applied as part of a full-stack observability model. This way, you can not only pinpoint deviating application performance and isolate the problem in near-real-time, but you can also prioritize and automate action — no matter where the problem is located across your entire heterogeneous IT environment.
You can also think of these ML-powered layers of observability as the core elements of artificial intelligence for IT operations, or AIOps. AIOps refers to the use of artificial intelligence (AI) and machine learning to ingest and analyze large volumes of data from every corner of the IT environment, reducing its complexity by bringing data silos together with the means to filter them, detect patterns, and cluster meaningful information for more efficient actioning.
So how can you apply anomaly detection through an AIOps lens?
While both automated and manual anomaly detection can alert you to performance problems in your application, the two anomaly detection methods differ in a few ways.
For one, automated anomaly detection applies specially designed algorithms so you don’t have to configure anything (except if you want to limit anomaly alerts). Instead, it can automatically detect any abnormal reading for errors per minute (EPM) and average response time (ART) metrics, then combine the data it learned from these metric readings using heuristics, which are designed to reduce alert noise.
For another, ML employs multiple techniques to ensure the accuracy of the metric data it collects. It normalizes the metric data — for example, when determining the EPM metric data, any spikes may not indicate a real problem unless there’s a corresponding increase in calls per minute (CPM). And, rather than applying traditional seasonal baselines, automated anomaly detection can correlate the variance of EPM and ART to CPM to obtain reliable results.
Here’s a snapshot of how this automated approach differs from manually configuring thresholds:
|Automated anomaly detection||Manual anomaly detection|
|Removes the stress of threshold-setting and updating as your tech stack evolves, by letting machine learning baseline your environment and alert you to metrics that deviate from “normal” (identifying a wider scope of problems than humans could).||Manually created to apply logical conditions that one or more metrics must satisfy — for example, you could monitor the ART to check if this metric deviates from the configured baseline.|
|Minimizes alert storms by using these self-learning dynamic baselines to filter out noise related to key events you frequently encounter (take the Black Friday example).||Requires you to set alerting policies and metrics and continually tune them to reduce alert noise — or else risk having to debug false positives for which you lack context.|
|No configuration required, except to limit anomaly alerting.||Manually created as required across time periods, trends, and schedules.|
|Associates anomalies with business transactions.||Applies to any entity, such as service endpoints as well as business transactions.|
You’ll notice that the full-stack observability model also creates a closed-loop feedback system. This means that once problems are identified (and based on historical data from past issues) ML suggests the best approach to accelerate remediation. But that’s another topic for another day — you can learn more about this and other AIOps use cases here.
The key to automating anomaly detection is finding the right combination of supervised and unsupervised (without human interaction) machine learning. On the one hand, you want the vast majority of data classifications to be done by machines. But there’s a danger in receiving too many irrelevant alerts and there could be anywhere from zero to three suspected causes, so we still need humans to effectively troubleshoot anomalies.
Good APM solutions address this with intelligent alerting, using ML to filter and correlate only meaningful data into incidents. This prevents alert storms coming from domino effects — for example, a failure in System A triggers an alert, impacting system B, which also triggers an alert, and so on. Intelligent alerting reduces alert fatigue and helps with prioritization based on user and business impact.
In case you’re not already convinced, let’s recap the four major areas where ML has the potential to help IT professionals and protect the business:
- Breaking down operational silos
When you’re able to see where the anomaly is coming from, you can get the right teams involved and singing from the same hymn sheet, removing the need for war room scenarios that incur unnecessary costs and inefficiencies.
- Reducing MTTR
ML can reduce MTTR from hours to minutes or even seconds by automating exactly where and when to initiate a performance fix. This pays dividends in productivity for IT and the business costs associated with performance problems.
- Proactive monitoring
An AppDynamics survey showed that the majority of enterprises are alerted to an anomaly by users or non-IT teams before IT detects the problem. Rather than reacting to anomalies — by which time users may already be impacted — IT can build a more proactive approach to performance monitoring. By taking in the totality of application environment data continually and automatically, ML can connect the dots between performance insights and business outcomes.
- Better decision-making
ML can surface insights that help IT professionals better (and more quickly) understand the context behind application and business health — and context is everything when you consider the many different application dependencies and performance variables involved. Importantly, you can also better prioritize alerts based on business impact.
And when IT is empowered to leverage these advantages, it can better drive the AIOps mindset required to compete in today’s app-driven, digital-first landscape: valuing proactive over reactive, answers over investigation, and a relentless focus on the customer experience.
Today, the use of AI and machine learning continues to gain momentum in anomaly detection, root cause analysis, and other use cases in APM. And at AppDynamics, we believe AIOps can put IT teams back in control of the tsunami of data generated these days, which far outweighs what humans can handle. It’s not about replacing jobs through automation — it’s about providing teams with the right information when and where it’s needed to help them make smart, informed decisions in real-time.
By unifying AIOps and application intelligence, AppDynamics Cognition Engine empowers enterprises to gain critical insights from trillions of metrics. Let's see it in action.