Six months ago I did something really stupid. I foolishly jumped on the social media bandwagon, thinking I could become the first super hero to claim online greatness. Sadly, the only meteoric rise has been the disk space quota for my email server inbox–all thanks to the billion notifications I now get daily from LinkedIn, Facebook and Twitter. For all I know I could have been poked by He-Man, tweeted by Krusty the Clown or propositioned by Batman to join forces on LinkedIn. Sadly, the amount of crap I get these days from trigger-happy social media apps means I simply ignore and delete 99.9% of messages without ever reading them.
Yet when I was a meek, mild mannered operations guy I used to experience the exact same thing with my IT monitoring software. My NOC dashboard changed colors so much we occasionally had ambulances onsite to treat people for epilepsy fits. We literally received more alerts than a paranoid immigrations officer.
What did we do with all these alerts? In the end we ignored them and it felt damn good. In fact, for 90% of the time our infrastructure and applications ran perfectly–although for the other 10% of the time, we felt a little silly when the business kicked our ass for production outages. However, even if we’d paid attention to alerts, there is no way we had enough time or patience to manually join up the alerts and events to prevent or solve outages rapidly. Our ticketing system and event console was a sinking ship, and one that was sinking many times with OutOFDiskSpace errors thanks to our frequent alert storms.
Bottom line, our monitoring solutions were useless, they had no intelligence, and they collected data 24/7 without anyone paying attention to them. We tried to tell our IT Ops director this, but he frequently played golf with several of the IT vendors so he struggled with selective hearing occasionally.
So what exactly causes alert storming?
Simply put, nothing in IT or the business is equal. You can’t define a single alert threshold that fits every application or business transaction use case. Why? Because change is constant and all business transactions are unique.
Imagine if we set the following static threshold:
“Alert me if any business transaction in my application takes longer than 2 seconds”
Now Imagine a simple retail website that has two business transactions:
- “Add to Cart”
- “Confirm Payment”
It’s entirely realistic for “Add to Cart” to take several milliseconds, yet it’s also possible for “Confirm Payment” to take several seconds, as credit card approval is typically done by a remote 3rd party provider. In this example, the monitoring solution will fire alerts for every “Confirm Payment” that takes longer than 2 seconds. The reality is that it’s entirely normal for “Confirm Payment” to take longer than 2 seconds, but the guy on operations or app support doesn’t know that when an alert lands in his inbox.
It’s the same deal with Infrastructure; you can’t set a threshold like “alert me if any server CPU goes above 95% utilization.”
Why? Because some servers can happily run at 95% or 99% cpu utilization with no detrimental impact to the application or business.
The most key metric for IT is when business activity (business transaction throughput) drops, as this is when business impact is felt the most, regardless of whether smoke or steam is pouring out of a server. If business activity is constant and servers are spiking at 95% CPU utilization, why should anyone care?
It’s true that many monitoring solutions let users override static thresholds with exceptional thresholds for specific business transactions or servers that require custom thresholds. The problem with this scenario is that each threshold must be manually maintained as the application and its infrastructure evolves over time and becomes more agile. An application can have hundreds of business transactions and infrastructure components; therefore, it’s an endless task to maintain threshold configuration. This is why most inbox and event console are full of irrelevant alerts.
So what is the solution to alert storming?
Monitoring solutions need to become smarter so they can work harder for their end user. They need to regain the trust and credibility of operations teams so alerts save time rather than waste it. The use of analytics, for example, can help a monitoring solution learn and build a dynamic baseline of normal and abnormal behavior of an application and its business transactions over time. This ensures alerts are only fired when real issues actually occur. If a business transaction normally takes 8 seconds to complete then so be it; the monitoring solution can learn this and only fire alerts if that response time deviates away from an 8 seconds baseline. Conversely, if a business transaction normally takes 4ms and increases to 8ms (100% increase), the monitoring solution can detect an issue and send an appropriate alert specific to that business transaction.
The monitoring solution can also determine daily or seasonal patterns. For example, an application may experience higher transaction volumes during the weekend than on a weekday, and a monitoring solution must therefore adjust its view of normal behavior as response times increase with transaction volume during peak hours, days, or months.
Is your Business Static or Dynamic?
If the monitoring solutions you rely on today use static alerting thresholds, ask your monitoring vendor a simple question – why? Your business isn’t static, and your business transactions and services aren’t are all the same. This is why many monitoring solutions of today are fundamentally broken (and stupid).
When we built AppDynamics, the first thing we did was build in analytics so we could dynamically baseline the normal behavior of customer applications and their individual business transactions. If you seriously want to reduce MTTR in your organization, you need to able to identify real issues and act on them asap. Without basic analytics in your monitoring solution, you’ll struggle to consistently find those needles in the giant haystacks of alerts that clutter up your IT operations. The more monitoring solutions work harder, the more agile Operations can become–they spend less time configuring and more time managing the performance of applications and the business.