You Cannot Improve What You Do Not Measure

Lee Eason, Director of DevOps, Ipreo

Lee Eason, Executive Director, Technology, IHS Markit, As the Executive Director, Technology at IHS Markit, a leading global provider of financial services technology, data and analytics, Lee Eason is seri... More >>

Let’s get one thing straight: Your software has performance problems. Whether that is at the top of your mind or not is probably driven by how often your customers report those problems. Regardless of whether you’re thinking about how to solve them or not, building a culture that enables and embraces Application Performance Monitoring (APM) as a normal part of your Software Development Life Cycle (SDLC) will pay dividends.

Application Performance Monitoring

Let’s say you’ve got a SaaS application that has a web application, a services layer, a database, and a file-store. How does your concurrent user count change throughout the day? How many queries are being executed per second? Is there a correlation between a particularly slow loading page and a specific part of the application stack? How long does it take to get a file off the file-store? Do we have enough job processing servers to handle our queue during the day? These are all questions that APM can help you answer.

 APM takes the guesswork out of resolving performance problems that your customers are experiencing, even if they don’t report them

So many tool options

This article is about why you should be embracing this type of monitoring and less about which specific tools to use. There are a few tools that are worth considering. I’d strongly encourage you to evaluate DataDog, Microsoft’s Application Insights tool, or bigger tool suites like New Relic or AppDynamics. You could also roll your own like Etsy did in this well-known post.

There is no “improve” without “prove”

APM takes the guesswork out of resolving performance problems that your customers are experiencing, even if they don’t report them. It’s the first way your team will leverage APM, and it’s how you’ll recoup your investment. Simply put, if your teams don’t have access to an APM tool, when they are called on to fix a performance problem, they’re doing it blindfolded. They will spend more time troubleshooting, reproducing issues, and finding the root cause. Without metrics looking across the application, they’ll rightly worry that their change may have hurt performance elsewhere in the application.

With APM, your developers will be able to “peel back” the layers of code that drive those bad customer experiences. They will know what’s wrong rather than having to guess. Without an APM tool, they’re probably dumping metrics to the log and trying to piece together the cause and effect chains by hand. That can take hours, even days of work; worse still, it’s a manual, error prone process. Without APM, when a system is performing poorly, engineers make changes that may not even be related to the issue. They can end up fixing problems that don’t exist. Do this enough times, and you’ve created additional technical debt (known problems in your source code) that they will have to clean up later. Your development team is plenty busy – there’s no need to make more work for them, especially work that doesn’t move your team, or your company, forward.

APM metrics also help the team have confidence in the changes they make. Instead of hoping their change fixed the problem, APM gives them proof they can share with the customer support team and other stakeholders so that everyone involved knows two critical things: the fix worked, and it didn’t break anything else. However, as valuable as all this is, these are the obvious things APM does for you. The real value is in the ripple effects.

The Observer Effect

We’ve all heard it before: people tend to focus on and improve things that are being measured. This has a tendency to work against most attempts at measuring creative processes, such as software development. If you attempt to measure the productivity of a developer based on how many lines of code they commit, for example, you will likely see a change in behavior that favors that metric, to the detriment of the codebase. This is called the Observer Effect.

This effect can also be really positive, and it’s one of the best reasons to embrace APM.

First, create a simple framework for your engineers to start tracking whatever they want. Don’t put up gates or roadblocks to adding new metrics - it’s just a line of code. Create a fast-track release process for releasing changes that only affect monitoring.

Developers will start adding metrics out of curiosity. At first they’ll be obvious things, like page render times. Suddenly though, they’ll see something jump out at them. Why does that one page take so long to render all the time? That outlier represents a bad user experience for your customers that may not have been reported. That errant metric will annoy your teams, and in the process of diagnosing the root cause they’ll end up adding metrics to track all kinds of things. Soon, they’ll have data points for database queries, file transfer times, and other points of interest. Now, when a problem occurs, they can quickly correlate those metrics with other data points, enabling them to rapidly find and fix problems as they become visible.

The end result is beautiful. Your engineers start to understand the application, not just architecturally, but in the context of how it is used by your customers – perhaps the most important thing that an engineer can know about an application. I’ve seen situations where the developer who wrote most of the code was shocked by what the metrics showed in production. Why is that? Because customers are resourceful and they will use your system in ways no one could predict, especially if your system has been deployed for a long time. Without APM, your engineers will have no way to know how their system is used by customers, and they’ll be trying to fix an application that doesn’t behave the way they think it behaves. APM can help highlight areas of your code that haven’t kept up with how customers use it, illuminating problems and enabling engineers to find the right solution.

More Insightful Load Tests

Having a suite of automated load tests is a luxury that many teams don’t have. They can be tricky to build and tend to be more brittle than other types of tests. However, the value of knowing the breaking point of your application makes it worth it.

When you add APM data to what you get from your load tests, the whole exercise becomes much more valuable. Now your team can see which component is the bottleneck in performance and scale. The load test shows how many concurrent users you need to break the app, and the APM data shows why the application broke. The key here is that you learn this before your customer experiences a problem.

A Change in Perspective

The last change to highlight is that your engineering groups will start to look at your application and their jobs differently. The code becomes a way to influence metrics that directly tie to customer experience. When a teammate refactors code in a way that improves the metrics, then the whole team has something they can celebrate. Ego gets set aside in favor of something much more important.

Put another way: APM helps your team organically gel around a common goal that naturally lines up with business priorities. It helps your engineering team remember why they’re here. Writing code isn’t the WHAT, it’s the HOW. Their job is to create a great user experience for your customers, meeting their needs by creating the best software possible. APM creates a firm connection to that mission.

Why don’t we have this already?

If APM is so valuable and so great, then why hasn’t my engineering team already done this, you ask? First, people don’t know what they don’t know. This layer of monitoring is easy to miss, and not many shops do this well in my experience. Second, it’s possible your team wants to put this in place and just can’t find the time to do it. They’re too busy putting features out in production or fighting fires. Make sure your team has enough innovation time in their schedule to make these kinds of targeted improvements. There are a lot of seasoned engineers out there who know in the back of their minds that something like this would be nice, but finding the time to do it can seem impossible.

That said, I can tell you from experience that implementing this layer of metrics is not complicated. It usually doesn't get done because people don't understand the value it can bring. Starting an APM initiative can be done in just a few days – a small investment compared to the benefits it will bring to your company. The first time your engineering team proactively heads off a major systems outage or performance slowdown, APM will pay for itself and then some. Make the decision to start an APM initiative, and you’ll be glad you did.