Mean time to repair (MTTR) measures the average time from when an issue is initially detected to the moment the component or system's functionality is fully restored. MTTR is a useful metric to assess the maintainability of an application or infrastructure, the lifecycle costs of equipment, and the efficiency of an organization's DevOps team.
Components or systems that can be repaired quickly will have a low MTTR and associated outages are likely to have less of an impact on business outcomes. A high MTTR can result in significant unplanned downtime and may have a negative impact on the overall user experience.
Measuring diagnostic time, repair time, testing and other activities that relate to identifying and mitigating performance issues can provide essential clues to your team's incident management capabilities and may highlight potential areas of improvement that can help optimize your application, infrastructure, and workflow.
MTTR is a key performance indicator and a critical component of developing an agile and dynamic DevOps strategy.
In the past, MTTR mostly referred to hardware, and IT teams used a combination of redundancy and replacing devices prior to the predicted end of their lifecycle to proactively avoid system failures.
The adoption of cloud computing has placed a lot of the responsibility of maintenance and performance on the providers of Infrastructure as a service (IaaS) or Platform as a Service (PaaS) providers, with the acceptable rate of MTTR often negotiated as a part of service level agreements (SLAs). As a result, DevOps teams can often focus solely on debugging their own applications or on-premises equipment.
The formula for calculating a basic measure of MTTR is essentially to divide the amount of time a service was not available in a given period by the number of incidents within that period.
Tracking the total time between when a support ticket is created and when it is closed or resolved is an effective method for obtaining an average MTTR metric.
Establishing a baseline for identifying and resolving performance issues and working continuously to improve upon that number results in reduced costs, improved reliability, and increased customer satisfaction.
Failure metrics are valuable KPIs that allow organizations to track the reliability of their systems. "Failure" doesn't necessarily indicate a complete outage, but can also represent general functionality issues or degradation. Other important failure metrics to be aware of include:
Mean time to recovery
MTTR is also referred to as mean time to recovery, resolve, or resolution and is the length of time between when a problem arises and when it is solved.
Mean time to failure
MTTF represents the average duration of a system or component's overall lifecycle and refers to items that are not repairable. There is no need to calculate repair times when an item requires replacement.
Mean time between failures
MTBF denotes the average operational time between failures and is used to forecast the availability of systems and components. MTBF is calculated by measuring the time between failures of systems or components.
Collecting data-based evidence of when failures may occur and what the potential impact may be is crucial to effectively managing, monitoring, and mitigating performance issues.
AppDynamics helps your organization establish a center of monitoring excellence for efficient and effective performance monitoring and application management. Harness the power of artificial intelligence and machine learning with the Cognition Engine to guide root cause analysis and improve productivity while reducing MTTR, SLA breaches, and system downtime.
Hear from our customers
"The ability to trace a transaction visually and intuitively through the interface was a major benefit. This visibility was especially valuable when Nasdaq was migrating a platform from its internal infrastructure to the AWS Cloud."
Heather Abbott, SVP Corporate Solutions Technology, Nasdaq