Slowdown is the new outage (SITNO)

September 06 2019

With 'Orange Is The New Black' (OITNB) wrapping its final season, let's reclaim the title formula 'x is the new y' with SITNO. This post explores tracing, monitoring, observability and business awareness. By understanding the difference in these four methods, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

Common application outage sources have been addressed by implementing Agile, DevOps and CI/CD processes. The resulting increase in system uptime allows site reliability engineers (SREs) to move their focus onto tuning performance, and for good reason. While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss.

Whether accessed via web browsers, smart phones or Internet of Things devices, slowdowns drive customers to abandon shopping carts and consider competitors. Slowdowns lead to reputation loss for enterprises—a loss that may even flow to an engineer’s career. If you were considering hiring an SRE, how much weight would you give to the company’s reputation for poor or unpredictable customer experiences?

As high blood pressure is a silent killer of humanity, slowdown is the silent killer of reputations.

Slowdowns vs Outages

Consider the significant differences between outages and slowdowns:

Slowdowns are commonly the result of resource constraint. Either you don’t have enough of the resource, or you’re using the resource poorly, causing contention. If you have too many network transactions on a narrow bandwidth, or if system memory is filled with unnecessary locked pages, a slowdown could result. In a prior life while managing hospital data centers, I saw invalid HL7 messages generate recurring error records into message queues and choking inter-hospital communications. Nurses had to run between laboratories and wards with results as the needless error messages caused a slowdown in the genuine laboratory results getting through. We know outages lose customers, but when there are no outages, what will drive customer loss?

Slowdown is the new outage. #slowdownisthenewoutage #SITNO

Insight vs Observability

DevOps methodologies came with a minimum requirement for monitoring application performance in production.

In turn, SRE comes with the requirement for observability—the capacity to reach into the code and answer an unpredictable question.

While observability supports diagnosis, insight is needed for resolution. SRE implementations create a team of engineers delivering a platform of products and processes for developer usage to ensure highest availability. In addition, SRE moves the focus from reaction to proaction, generating a requirement for spotting the initial predictors of slowdown. This creates the need for a way to observe what code is doing while running production. Observable metrics need context to become actionable insight.

AIOps delivers the ML-driven automatic baselines and contextual correlation to allow SRE teams to engage preemptively (which in turn improves business outcomes, as Garter’s AIOps paper reports). Once a predictor anomaly is triggered, the SRE team can respond by updating a SQL query, coding a new function call, or scaling up resources to prevent the slowdown from escalating into a threat to the business. Post-response, the SRE team can then pass the details back to the application owners for longer-term resolutions.

While dtrace or manual breakpoints may be great for single applications on single machines, they will “often fall short while debugging distributed systems as a whole,” notes Cindy Sridharan in Distributed Systems Observability. When trying to diagnose a complete customer experience relying on multiple business transactions in distributed multi-cloud production applications, observability falls short of insight. The good news is that if you have implemented monitoring as part of your DevOps rollout, the APM used to react to outages can be expanded to observe and diagnose slowdowns.

Finding Insight on Top of Observability

Neither monitoring nor observability is an end unto itself. For slowdown detection, we must see the broader picture of the total user experience. We must be able to take a step back from our usual I-shaped technical silos and apply T-shaped skills to seek insight into the causes of slowdowns.

Supporting observability can overload applications with additional code for metric creation capturable by APM. Observability only requires the individual metrics be present within the code without correlating them into the overall customer experience.

Delivering insight requires several key functions:

Baselines identifying normal performance
Segmented metrics of customer business transactions to identify weak points
Levers to isolate code portions within the production environment
Common trusted metric sources that span technology silos
Overhead minimization when performance is normal
Noise filtering from using ML-trained filters for anomaly detection

Creating observability within each application individually incurs technical debt, while an SRE-supporting APM solution can deliver observability across multiple applications. Moving to a DevOps or SRE model is problematic when you lack an understanding of how to observe and gain insight from metrics. Read more on how APM applies to DevOps.

Remember, it is the metric you don’t watch that bites you.

Marco Coulter

As the Technical Evangelist for Analytics and Business Intelligence at AppDynamics, Marco Coulter is passionate about the experience humans have when interacting with technology. A former startup CTO, Marco has progressed from operator to leadership roles at CSC, CA Technologies, and more recently 451 Research, where he led the Business Intelligence team. He earned the nickname "the tech-whisperer" for his skills in translating business drivers for a technical audience and technical concepts for business leaders. When taking the rare break from technology, Marco can be found harvesting fresh vegetables from his garden.

Slowdown is the new outage (SITNO)

Slowdowns vs Outages

Insight vs Observability

Finding Insight on Top of Observability

Cisco AppDynamics: A Gartner® Customers’ Choice Vendor

On-prem application performance monitoring is still relevant, here’s why

Slowdowns vs Outages

Insight vs Observability

Finding Insight on Top of Observability

Related Posts

Cisco AppDynamics: A Gartner® Customers’ Choice Vendor

On-prem application performance monitoring is still relevant, here’s why

Four ways full-stack observability drives organizational success