CAT | Agile & DevOps
One of my colleagues this week was consolidating the results from our recent Application Performance Management survey, and one interesting finding was that 40% of customers have at least one release cycle a month. Out of those respondents, one third experience a Severity-1 incident each month as well. That’s a pretty compelling pair of statistics, and they might explain the continued frustration and conflict between development and operations teams. It’s also perhaps the reason why this DevOps underground movement can no longer be ignored (even by Gartner). There is no doubt development organizations have become agile, but does deploying this frequent change make the business more or less agile? For example, if one in three releases creates a Severity-1 incident, then surely agile development becomes a risk to the business. We’re at the point where Operations either has to start managing change better or simply restrict the amount of change that can occur.
So why are Sev 1 incidents so common? Based on my experiences and customer interaction, I’d strongly argue that testing in development isn’t enough. At the very least, it’s certainly not an insurance policy for deploying an application in production. When a Formula 1 team designs a car in a wind tunnel and tests it on a simulator pre-season, they don’t assume that the performance they see in test will mirror the results they see in a race. Yet, that is pretty much what happens today in the application development lifecycle. Development teams build and test their apps in pre-production before handing it off to operations for deployment in production, and they assume everything will work just fine. This is probably the worst assumption IT has made over the last decade, because development and production environments differ significantly. It’s also a lame excuse for any development team to use when a production issue occurs: “Well it ran fine in test so you must have deployed it wrong.” Yes people make mistakes occasionally, but if one in every three releases has an issue, deployment error may not be the sole reason. If development never get to see how their baby runs in production, they’ll never learn how to build robust, scalable, and high-performance applications.
On Wednesday I delivered a keynote at WJAX in Munich. Everything went really well, but I was a little shocked at the response I got when I asked the audience “How many of you monitor the performance of your apps in production?” As I scanned the audience, I counted 9 out of ~950 developers had put their hands up, meaning about 1% had visibility of how their applications actually performed in production. I know what you’re thinking: “But isn’t application performance in production the responsibility of Operations?” Well, it is and it isn’t. Most organizations think that when an application has an issue, it’s related to the infrastructure it runs on. That’s like saying when a car crashes, it’s because a part failed on the car whereas in actual fact most accidents are caused by the driver. Yes, hardware fails occasionally, but application logic and configuration drives how infrastructure resource is used, which is why most issues today occur when new code is deployed in production.
The majority of us in IT are specialists, with the exception of a few VPs of engineering who are “special” in their own “special” world of being “special.” What I mean by this is that no single person has the skills or experience to do everything well in IT. IT is too big for me to explain or summarize in a few words, other than it requires a lot of different people with different skills to make it tick along. Despite applications being the living breathing entities of the business, a large portion of folk in IT have little context of how applications are built, how they execute, and how they consume resource across the IT infrastructure. Many people simply don’t care as their responsibilities are completely void of anything application related. That’s fine–but the reality is that everyone in IT should have one eye on the business. The whole reason IT exists is so the business can be more competitive and make more money. If this happens, IT gets more budget and is allowed to innovate more. IT and the business need each other to survive, which is why when applications slow down or break, both parties bitch at each other.
Operations need better visibility
Unfortunately for both the business and IT, the people (Operations) who manage the performance and availability of applications in production aren’t application experts. They are also not stupid either; their skills sets are wide and broad across many technologies and platforms that underpin applications. They manage a lot of things that application developers take for granted, like networks, databases, storage and virtualization. While Operations monitor the health of these infrastructure components, they often get bombarded with crap from the business when end users and business transactions are being impacted by slow performance, despite all system monitoring showing everything is fine. This lack of understanding between the Business and Operations is because both parties see things from different perspectives.