Applications were failing long before Cloud came along

image_pdfimage_print

I’m fed up of reading about Cloud outages, largely because all applications are created and managed by the most dangerous species on the planet – the human being. Failure is inevitable in regards to everything the human being creates or touches, and for this reason alone I see no news in seeing the word “outage” in IT articles with or without Cloud mentioned.

What gets me the most is that applications, infra-structure and data centers were slowing down and blowing up long before “Clouds” became fashionable. They just didn’t make the news every other week when applications resided in “data-centers”–ah, the good old days. Just ask anyone who works in operations or help desk/app support whether they’ve worked a 38 hour week; I guess the vast majority will either laugh or slap you. If everything worked according to plan, IT would be a really dull place to work, help desk would be replaced with OK desk, and we’d have nothing to talk about in the office or pub.

So why do application outages still occur?

Here are my top five reasons:

1. Building and supporting applications isn’t easy and humans make mistakes

2. IT Operations are responsible for High Availability of Infrastructure

3. Development are not responsible for High Availability of their application (that’s down to App Support).

3. Lack of fault tolerance in application design and architectures

4. Being Re-Active is an adrenaline rush

For me, planning for a failure starts right at the beginning of application design, with high availability nailed firmly into the non-functional requirements (as boring and crap as that sounds). Remember when that guy with the beard and brown jumper lectured you on high availability, fault tolerance and disaster recovery during your computing degree at college? Those lectures turned out to be sound bits of advice. It’s just most people were either asleep, hungover or at the pub.

I watched a pretty scary high availability technique used a few years back. A sys admin for one the UK’s largest mobile retailers basically watched his prstat console on his desktop and periodically restarted application servers if their % CPU utilization stayed at 100% for more than 30 seconds. What was interesting on this occasion was that their application architecture didn’t implement any form of session replication, so the uber keen sys admin was effectively killing hundreds of user sessions who were shopping on their retail website. This is a simple example of how a lack of fault tolerance can hurt your application and its business.

So where am I going with this? Well, many organizations are thinking about moving their apps to the Cloud, so they can leverage the business agility and cost benefits it promises. As attractive as this sounds you need to consider an important risk around Cloud and that is Quality of Service (QoS). Most Private and Public Cloud providers are very coy around what QoS they actually provide, and as we’ve seen over the past year many providers can and will suffer from periods of server and/or network unavailability. This might sound like the end of the world for Cloud, but actually it isn’t. Chances are your data centre with its many servers and networks has suffered periods of downtime at some stage. The difference with Cloud is that someone else is responsible for putting it right. Therefore, planning for failure in the Cloud should be no different to planning for failure in your data center.

Planning for Failure

When you design, architect and develop an application, you’ve got to assume it WILL fail no matter where you deploy it. Building fault tolerance into both your application AND its infrastructure dependencies is critical to any high availability need. Applications process and manage data on multiple servers, which are distributed and can be accessed across multiple networks, data-centers, clouds and locations.

So what happens to your application when the following occurs:

  • A storage device fails
  • A server hangs or fails
  • A network fails
  • A data-center location fails
  • A cloud fails

Is your application smart enough to continue with anyone of these failures? Does it have fault tolerance built into it so it can recover by itself? or does it require IT staff to shout and work their butt off to get everything back to normal? Can your application really recognize and tolerate failure?

Cloud Providers aren’t the problem

In April of this year one of Amazon’s availability region zones had issues with its storage service, for customers who had their mission critical apps deployed in this specific availability zone – their business stopped until Amazon fixed the problem. For customers who built their application with fault tolerance using multiple availability region zones in Amazon – their apps continued to run as normal. Amazon and the public cloud reputation ended up getting a kicking for this. What’s funny (and tragic I guess) is that whilst this cloud outage was happening, many applications around the world were also crashing in perfectly normal on-premise data centers (or private clouds as they are now called). Outages don’t just relate to Clouds: they’ve related to applications for as long as they’ve existed.

Application Performance Management solutions can monitor application QOS and rapidly detect and resolve application outages, but they don’t replace the need for organizations to become more proactive when it comes to planning for failure and building fault tolerance in to their applications. As a famous English poet once said, “Hope for the best and plan for the worst.”

App Man.

 

  • http://twitter.com/CodingFabian Fabian Lange

    Great post. I recently came across the term “crash only software”.
    It describes software that can recover itself fully automated.
    The design goal of such software is extreme: expect that you cannot rely on anything and assume you crash seconds after startup.

    While this is very extreme, there are so many great options for good software that repair themselves, its a shame so many apps are build that need a full operations team and a day to recover from minor issues.

  • Anonymous

    I think the cloud took on this image that the underlying technology is of a new order of advancement and engineering and so it is not susceptible to the complexities/challenges of large scale computer infrastructure. It’s important (and healthy) for people who use the cloud to remember that the ‘cloud’ that they build their application on is a room full of servers just like it has been for 25+ years.

Copyright © 2014 AppDynamics. All rights Reserved.