Welcome to Part 2 of the series “Deploying APM in the Enterprise“. In Part 1 I provided some background on what this series is all about and why you should continue reading. Part 2 will provide more foundation and my take on APM maturity. This will not be one of those boring theoretical maturity models but instead will come from the reality of fighting fires and utilizing APM tools in an Enterprise situation. Whether you’re in a small shop or a giant Fortune 100 these same principles will apply. The biggest difference should be the amount of budget you have to work with (sorry small shops, the large enterprises tend to have a lot more money in the budget).
The way I think about APM maturity is by the questions and comments that are associated with capabilities and overall maturity. Here’s an example… When my kid was 3 years old he asked “Where do babies come from?“. This is a very obvious indicator of where he belongs in the “Life Maturity Model”. Anyone asking this question probably belongs at that same level. An important aspect to remember about any maturity model is that you can be (and most organizations are) associated with multiple levels of maturity at any given point in time. So let’s take a look at my version of an APM maturity model!
Hirsch’s APM Maturity Model:
Level 0 – WTF Just Happened?
- We just got a bunch of phone calls that the Website/Application is slow. Really?
- CPU, memory, disk, and network all look great. Why is it still so slow?
- You start making phone calls or start/join a conference call asking
- Did you change anything?
- Do you see anything in the log files?
- Are we having network problems?
- Can someone get the DBA on the line? It has to be the database!
- It just started working again. Did anyone change anything?
- Is it fixed?
- 3AM phone call from help desk..It’s broken again. Damn!
- How does a business transaction relate to IT infrastructure?
Level 1 – Ouch, too much information!
- Our new monitoring tool sure does provide a lot of data. Look at all these charts to spend hours digging through.
- It took a long time to set up all of those alert thresholds but I bet it will be worth it.
- Why so many alerts? Did everything break at the same time?
- Is anything really wrong? I don’t know, go test the site/app to find out.
- It worked great in dev/test/qa. What’s different about prod?
- We profiled our code in dev and it is still slowing down in prod. Why???
- It’s still slow for our customers? It looks fine from the office.
- Our APM tool is okay for testing but we wouldn’t dare use it in production.
- Does anyone know what dependencies exist between our applications?
- I heard about something called DevOps. Any idea what it is?
Level 2 – Whew, that’s getting better!
- We’re still getting a lot of alerts but now we know if apps are slow or broken.
- We don’t set alert thresholds very often, our tooling alerts us automatically when important metrics deviate from their baselines.
- Looks like some of the functions in our app are always slow. Let’s focus on optimizing the ones that are used the most or are the most important.
- We built a dashboard for our app to show when it gets slow or breaks.
- We can see everything going on in test and prod and know what’s different between environments.
- We know if any of our end users are impacted because we monitor every business transaction.
- Yep, the problem is on line 45 of the DoSomething method.
- We automatically deploy monitoring with our apps. It’s part of our build/release process.
- Our applications and their dependencies automatically get mapped by our tools. No need to guess what will break if we make a change.
- Wouldn’t it be cool if we could automatically react to that spike in workload so our site wont slow down or crash?
- I wonder if the business felt any impact from that problem?
Level 3 – That’s Right, We Bad!
(That’s a reference from the movie Stir Crazy)
- We built a business AND technology dashboard so that everyone could see if there was any impact at any given time.
- All of our monitoring tools are integrated and provide a holistic view of the health of each component as well as the entire application.
- Whenever there is an application slowdown (or when we predict there will be a slowdown) due to spikes in user activity our tooling automatically adapts and spins up new instances until the spike ends.
- When any of our application nodes are not working properly our tooling automatically removes the bad node and replaces it with a new functional node.
- The data derived from our APM toolset is used by many different functional groups within the organization spanning both technology and business.
So you can probably identify with one or more of the maturity levels described above but what’s really important is figuring out how to advance your capabilities so that you can keep progressing to higher levels. Utilizing software tools is obviously part of developing higher levels of maturity but good processes and well trained people are also critical components of success.
In part 3 of this series we will discuss how to get started down the path of deploying APM in the enterprise so that you can advance your monitoring maturity and realize significant value in your business. I’m sure you can think of other comments or questions you have heard that relate to the APM maturity levels I’ve listed above. I’d love to hear your feedback in the comments section.