In most organizations, managing the service level of critical applications is still a challenge. For some there is a lack of strategic planning, for others it’s simply not applying the proper tools and methodology to their everyday work. Regardless of the reason there are steps that need to be taken in all organizations to avoid costly and damaging service disruption.
We’ve Stopped Making Money
One day, while working at an investment bank, I got a phone call requesting my help (it was really more like a plea and an order at the same time) with troubleshooting a business-critical application. I had even heard chatter in the office about how this application had been unusable for days before I was asked to participate. My role at the bank was as a monitoring architect who tested, reviewed, purchased, and on-boarded new tools among other responsibilities. As a result, I was one of the people who would get a phone call when difficult problems went unsolved for too long so I could apply my tools and expertise.
This was a time of great instability in the stock market and our traders were very active. This was also the time when the traders needed this particular application the most and when the bank should have made a small transaction fee for every completed transaction through this application. Simply put, the bank was loosing millions of dollars while this application was performing so poorly.
I started my work with the development team by getting a breakdown of the problem, the conditions leading up to the problem, an overview of the technology, and a demonstration of them recreating the problem in a test environment. Next, I deployed some application monitoring tools into their test environment — since they only had basic OS monitoring and the data that was coming from their load test tool — and watched as they ran more load tests. I could see certain parts of their code degrading as load ramped up and this led me to ask a lot of questions about the logic associated with these parts of the code.
I worked together with the development team for 2 days asking questions, seeing the mental light bulbs explode through the look in their eyes, and testing the new code they feverishly created after each bottleneck was removed and a new one discovered. After all was said and done the application was upgraded in production at the end my 2 days of involvement. Capable of handling 5 times the throughput helped the traders do their jobs, and most importantly, the bank was ringing the cash register again for each transaction.
Strategic Planning
The worst part is the situation could have been completely avoided. By following a few key rules, the application team could have detected this problem in it’s infancy and minimized or avoided the lengthy and embarrassing production impact.
- Where the rubber meets the road – Application performance monitoring IN PRODUCTION is a requirement for any business, mission, or revenue critical applications.
- Dev and QA monitoring – Using application monitoring tools in PRE-PRODUCTION will dramatically improve quality of production application releases.
- Feedback loop – Constantly apply the information gained in production to your pre-production environment. Use production loading and performance patterns shown by your monitoring tools to prioritize development work and the create more realistic load tests.
- Collaboration is king – Development AND Operations personnel should have access to and use the same monitoring tools during load tests to gain the most benefit.
- Think strategic instead of tactical – Implement a well thought out monitoring and management strategy starting with your most critical revenue generating applications and working down from there (after rigorous testing of course).
- Identify and fix small problems before they turn into big problems – Alerting should be based off of deviation from normal (baseline) behavior in most situations. Minimize the number of static thresholds you use to trigger alerts and make an investment in analytics-driven monitoring platforms. Static thresholds should mostly be used to identify service level breaches.
The reality of the 6 points outlined above is that it takes some initial effort to make the required organization and process changes as well as getting the right tools in place. However, the fact remains the investment is well worth it for business-critical applications. I’ve seen so many groups think they don’t have enough time to invest in strategic initiatives and then they constantly run around firefighting the next battle which should have been avoided in the first place. It’s a vicious cycle that needs to be end. Consider the tips listed above and break the cycle starting right now.