Mean Time to Repair: What it Means to You

June 25 2018
 

The newest member of our evangelism team shares his own experiences with system failures and the importance of implementing systems that allow for faster recovery along with innovation.


We’ve all been there: Flying home, late at night, a few delays. Our flight arrives at the airport and we’re anxious to get out of the tin can. Looking outside, we see no one is connecting the jet bridge to the aircraft. Seconds seems like minutes as the jet bridge just sits there. “This is not a random event, they should have been expecting the flight,” you tell yourself over and over again. Finally, a collective sigh of relief as the jet bridge starts to light up and inch ever closer to your freedom.

Even though the jet bridge was not broken per se, the process of attaching the bridge seemed broken to the end user, a.k.a. “the passenger.” The latency of this highly anticipated action was angst-causing.

As technologists, we deal with increasingly complex systems and platforms. The advent of the discipline around site reliability/chaos engineering brings rigor to mean-time-to-repair (MTTR) and mean-time-between-failure (MTBF) metrics.

For failure to occur, a system doesn’t have to be in a nonresponsive or crashed state. Going back to my jet bridge example, even high latency can be perceived as “failure” to your customers. This is why we have service level agreements (SLAs), which establish acceptable levels of service and the consequences of noncompliance. Violate a SLA, for example, and your business could find itself facing a sudden drop in customer sentiment as well as a hefty monetary fine.

Site reliability engineers (SREs) push for elastic and self-healing infrastructure that can anticipate and recover from SLA violations. However, these infrastructures are not without complexity to implement and instrument.

Mobile Launch Meltdown

I remember back when I was a consulting engineer with a major mobile carrier as a client. This was about a decade ago, when ordering a popular smartphone on its annual release date was an exercise in futility. I would wait up into the wee hours of the morning to be one of the first to preorder the device. After doing so on one occasion, I headed into the office.

By midday, after preordering had been open for some time, a cascading failure was occuring with my company, one of many vendors crucial to this preorder process. My manager called me to her office to listen in on a bridge call with the carrier. Stakeholders from the carrier were rightfully upset: “We will make more in an hour today than your entire company makes in a year,” they repeated multiple times.

The pressure was on rectify the issues and allow the business to continue. As in the novel The Phoenix Project, representatives from different technology verticals joined forces in a war room to fix things fast.

The failure was complex—multiple transaction and network boundaries, and speed of incoming orders on a massive scale. However, the notion of a large set of orders coming in on a specific date was not random, since the device manufacturer had set the launch date well in advance.

The Importance of Planning Ahead

The ability to tell when a violation state is going to occur—and to take corrective action ahead of time—is crucial. The more insight and time you have, the easier it is to get ahead of a violation, and the less pressure you’ll feel to push out a deployment or provision additional infrastructure.

With the rise of cloud-native systems, platforms and applications are increasingly distributed across multiple infrastructure providers. Design patterns such as Martin Fowler’s Strangler Pattern have become cemented as legacy applications evolve to handle the next generation of workloads. Managing a hybrid infrastructure becomes a challenge, a delicate balance between the granular control of an on-prem environment and the convenience and scalability of a public cloud provider.

Usually there is no silver bullet to fix problems at scale. If there’s a glaring issue the old adage, “Would have been addressed already,” proves true. In performance testing, death by a thousand paper cuts plays itself out in complex distributed systems. Fixing and addressing issues is an iterative approach. During a production-impacting event, haste can make waste. With all of the the investment in infrastructure-as-code and CI/CD, these deployments can systematically occur faster than ever.

We might not all experience an incident as major as a mobile phone preorder meltdown, but as technologists we strive to make our systems as robust as possible. We also invest in technologies that enable us change our systems faster—an essential capability today when so many of us are under the gun to fix what’s broken rather than adding new features that delight the customer.

I am very excited to join AppDynamics! I’ve been building and implementing large, distributed web-scale systems for many years now, and I’m looking forward to my new role as evangelist in the cloud and DevOps spaces. With the ever-increasing complexity of architectures, and the focus on performance and availability to enhance the end user experience, it’s crucial to have the right data to make insightful changes and investments. And with the synergies and velocity of the DevOps movement, it’s equally as important to make educated changes, too.

Ravi Lachhman
Ravi Lachhman is an evangelist at AppDynamics focusing on the Cloud and DevOps spaces. Prior to AppDynamics, Ravi has spent time at Mesosphere, Red Hat, and IBM helping enterprises and the federal sector design the next generation of distributed platforms. When not helping to further the technology communities, Ravi enjoys traveling the world especially with his stomach.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form