AIOps, Engineering

Slowdown is the New Outage (SITNO)

By | | 4 min read


Summary
With 'Orange Is The New Black' (OITNB) wrapping its final season, let's reclaim the title formula 'x is the new y' with SITNO. This post explores tracing, monitoring, observability and business awareness. By understanding the difference in these four methods, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

Common application outage sources have been addressed by implementing Agile, DevOps and CI/CD processes. The resulting increase in system uptime allows site reliability engineers (SREs) to move their focus onto tuning performance, and for good reason. While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss.

Whether accessed via web browsers, smart phones or Internet of Things devices, slowdowns drive customers to abandon shopping carts and consider competitors. Slowdowns lead to reputation loss for enterprises—a loss that may even flow to an engineer’s career. If you were considering hiring an SRE, how much weight would you give to the company’s reputation for poor or unpredictable customer experiences? 

As high blood pressure is a silent killer of humanity, slowdown is the silent killer of reputations.

Slowdowns vs Outages

Consider the significant differences between outages and slowdowns:


Slowdowns are commonly the result of resource constraint. Either you don’t have enough of the resource, or you’re using the resource poorly, causing contention. If you have too many network transactions on a narrow bandwidth, or if system memory is filled with unnecessary locked pages, a slowdown could result. In a prior life while managing hospital data centers, I saw invalid HL7 messages generate recurring error records into message queues and choking inter-hospital communications. Nurses had to run between laboratories and wards with results as the needless error messages caused a slowdown in the genuine laboratory results getting through. We know outages lose customers, but when there are no outages, what will drive customer loss?

Slowdown is the new outage. #slowdownisthenewoutage #SITNO

Insight vs Observability

DevOps methodologies came with a minimum requirement for monitoring application performance in production.


In turn, SRE comes with the requirement for observability—the capacity to reach into the code and answer an unpredictable question.

While observability supports diagnosis, insight is needed for resolution. SRE implementations create a team of engineers delivering a platform of products and processes for developer usage to ensure highest availability. In addition, SRE moves the focus from reaction to proaction, generating a requirement for spotting the initial predictors of slowdown. This creates the need for a way to observe what code is doing while running production. Observable metrics need context to become actionable insight. 

AIOps delivers the ML-driven automatic baselines and contextual correlation to allow SRE teams to engage preemptively (which in turn improves business outcomes, as Garter’s AIOps paper reports). Once a predictor anomaly is triggered, the SRE team can respond by updating a SQL query, coding a new function call, or scaling up resources to prevent the slowdown from escalating into a threat to the business. Post-response, the SRE team can then pass the details back to the application owners for longer-term resolutions. 

While dtrace or manual breakpoints may be great for single applications on single machines,  they will “often fall short while debugging distributed systems as a whole,” notes Cindy Sridharan in Distributed Systems Observability. When trying to diagnose a complete customer experience relying on multiple business transactions in distributed multi-cloud production applications, observability falls short of insight. The good news is that if you have implemented monitoring as part of your DevOps rollout, the APM used to react to outages can be expanded to observe and diagnose slowdowns.

Finding Insight on Top of Observability

Neither monitoring nor observability is an end unto itself. For slowdown detection, we must see the broader picture of the total user experience. We must be able to take a step back from our usual I-shaped technical silos and apply T-shaped skills to seek insight into the causes of slowdowns. 

Supporting observability can overload applications with additional code for metric creation capturable by APM. Observability only requires the individual metrics be present within the code without correlating them into the overall customer experience. 

Delivering insight requires several key functions: 

  • Baselines identifying normal performance
  • Segmented metrics of customer business transactions to identify weak points
  • Levers to isolate code portions within the production environment
  • Common trusted metric sources that span technology silos
  • Overhead minimization when performance is normal 
  • Noise filtering from using ML-trained filters for anomaly detection

 

Creating observability within each application individually incurs technical debt, while an SRE-supporting APM solution can deliver observability across multiple applications. Moving to a DevOps or SRE model is problematic when you lack an understanding of how to observe and gain insight from metrics. Read more on how APM applies to DevOps.

Remember, it is the metric you don’t watch that bites you.