TAG | Enterprise
Welcome to Part 2 of the series “Deploying APM in the Enterprise“. In Part 1 I provided some background on what this series is all about and why you should continue reading. Part 2 will provide more foundation and my take on APM maturity. This will not be one of those boring theoretical maturity models but instead will come from the reality of fighting fires and utilizing APM tools in an Enterprise situation. Whether you’re in a small shop or a giant Fortune 100 these same principles will apply. The biggest difference should be the amount of budget you have to work with (sorry small shops, the large enterprises tend to have a lot more money in the budget).
The way I think about APM maturity is by the questions and comments that are associated with capabilities and overall maturity. Here’s an example… When my kid was 3 years old he asked “Where do babies come from?“. This is a very obvious indicator of where he belongs in the “Life Maturity Model”. Anyone asking this question probably belongs at that same level. An important aspect to remember about any maturity model is that you can be (and most organizations are) associated with multiple levels of maturity at any given point in time. So let’s take a look at my version of an APM maturity model!
Hirsch’s APM Maturity Model:
Level 0 – WTF Just Happened?
- We just got a bunch of phone calls that the Website/Application is slow. Really?
- CPU, memory, disk, and network all look great. Why is it still so slow?
- You start making phone calls or start/join a conference call asking
- Did you change anything?
- Do you see anything in the log files?
- Are we having network problems?
- Can someone get the DBA on the line? It has to be the database!
- It just started working again. Did anyone change anything?
- Is it fixed?
- 3AM phone call from help desk..It’s broken again. Damn!
- How does a business transaction relate to IT infrastructure?
Level 1 – Ouch, too much information!
- Our new monitoring tool sure does provide a lot of data. Look at all these charts to spend hours digging through.
- It took a long time to set up all of those alert thresholds but I bet it will be worth it.
- Why so many alerts? Did everything break at the same time?
- Is anything really wrong? I don’t know, go test the site/app to find out.
- It worked great in dev/test/qa. What’s different about prod?
- We profiled our code in dev and it is still slowing down in prod. Why???
- It’s still slow for our customers? It looks fine from the office.
- Our APM tool is okay for testing but we wouldn’t dare use it in production.
- Does anyone know what dependencies exist between our applications?
- I heard about something called DevOps. Any idea what it is?
Level 2 – Whew, that’s getting better!
- We’re still getting a lot of alerts but now we know if apps are slow or broken.
- We don’t set alert thresholds very often, our tooling alerts us automatically when important metrics deviate from their baselines.
- Looks like some of the functions in our app are always slow. Let’s focus on optimizing the ones that are used the most or are the most important.
- We built a dashboard for our app to show when it gets slow or breaks.
- We can see everything going on in test and prod and know what’s different between environments.
- We know if any of our end users are impacted because we monitor every business transaction.
- Yep, the problem is on line 45 of the DoSomething method.
- We automatically deploy monitoring with our apps. It’s part of our build/release process.
- Our applications and their dependencies automatically get mapped by our tools. No need to guess what will break if we make a change.
- Wouldn’t it be cool if we could automatically react to that spike in workload so our site wont slow down or crash?
- I wonder if the business felt any impact from that problem?
Level 3 – That’s Right, We Bad!
(That’s a reference from the movie Stir Crazy)
- We built a business AND technology dashboard so that everyone could see if there was any impact at any given time.
- All of our monitoring tools are integrated and provide a holistic view of the health of each component as well as the entire application.
- Whenever there is an application slowdown (or when we predict there will be a slowdown) due to spikes in user activity our tooling automatically adapts and spins up new instances until the spike ends.
- When any of our application nodes are not working properly our tooling automatically removes the bad node and replaces it with a new functional node.
- The data derived from our APM toolset is used by many different functional groups within the organization spanning both technology and business.
So you can probably identify with one or more of the maturity levels described above but what’s really important is figuring out how to advance your capabilities so that you can keep progressing to higher levels. Utilizing software tools is obviously part of developing higher levels of maturity but good processes and well trained people are also critical components of success.
In part 3 of this series we will discuss how to get started down the path of deploying APM in the enterprise so that you can advance your monitoring maturity and realize significant value in your business. I’m sure you can think of other comments or questions you have heard that relate to the APM maturity levels I’ve listed above. I’d love to hear your feedback in the comments section.Link to this post:
It’s interesting as a parent watching your kids grow up and learn how to do things for themselves. I have 2 boys ages 6 and 7 and they often amaze me and confound me. There are times when I think they are absolutely brilliant and other times where I think that significant amounts of remedial education may be in their futures.
Duct Tape, and Knives, and Hammers Oh My
I recall a time not long ago when my kids first discovered the wonders of duct tape. What a magical substance. They really thought that you could fix anything with duct tape. Broke a lamp? Duct Tape! Cut your finger? Duct Tape! Building a boat out of sticks? Duct Tape! (Seriously, my oldest boy actually tried this.)
After getting some less than stellar results with duct tape in many situations the next wonder tool discovery was the knife. Wow, knives were amazing! They could cut your pieces of duct tape, create arrows and marshmallow holders from sticks, and generally make holes in lots of things. But alas, they eventually discovered that knives didn’t produce the blunt force required for certain jobs … but hammers did!
I’m not going to get into the gory details but let’s just say that many things got bashed beyond recognition during the “hammer phase”. Thankfully no animals were injured during this time period and our dog was smart enough to go into hiding for a few weeks until the heat blew over.
Let’s Get Real
What do these little anecdotes have to do with deploying APM in the Enterprise you ask? Great question. In most enterprise environments there are a wealth of monitoring tools that have either been built or bought. In many cases these tools are sitting around as shelf-ware or are only performing a limited subset of what they can really do. Part of the problem is that in most organizations there is little time or tolerance for learning from our mistakes. With this in mind I am going to write a series of blogs which describe my experiences of taking a large enterprise organization from APM infancy to a level of maturity most organizations only dream about.
The next question I hope you’re asking is… Why should I listen to this guy?
Another excellent question, you’re on the ball today! If you’ve read my other blog posts you might already know that I worked for a large financial services company as a monitoring architect. I was brought into the Investment Banking division to help reduce the number of incidents that were impacting end users. In a few months I (with help from many people) was able to stop the bleeding and get the organization headed down the path of APM maturity. Within a couple of years our organization was proactively seeking out and fixing performance bottlenecks as well as dynamically adapting to changes in workload demands. All in all I worked in that role for about 5 years and departed with things running very smoothly. Many applications saw significant improvement in overall response times and the number of customer impact incidents decreased by about 90%.
I learned a lot of lessons during those years. There were many examples of success but also some failures. Those failures taught us valuable lessons that I will pass along to you over this series of blog posts. So here is a rough approximation of the topics that I plan on covering:
- APM Maturity (not the same old boring model)
- Where do you start?
- Deploying APM (It’s about more than just the software)
- Alerts done right!
- Spreading the love. (Getting high levels of adoption)
- Dashboards and Reports
- Staying relevant over time.
Don’t Stab Your Brother, You Could Kill Him!
In my effort to be a good father I let my kids explore their world and give them more guidance when I think they might hurt themselves or others. It’s okay if they cut the tip of their finger with the knife but completely different if they were to stab someone with the knife. I share my knowledge with them so they won’t end up having really bad experiences but the finer details are left up to them to figure out. The same goes with this series of posts. I want to share knowledge to prevent you from stabbing your company in an artery but there are many fine points that you will need to discover as you progress on your journey.
Join me on Thursday for the next installment in this riveting series (APM Maturity), and try not to stab anyone in the mean time.