Planning for Failure with Application Performance Management (APM)

August 15 2011
 


If you are running mission-critical applications and do not have a strategy to deal with failure, you are putting your whole organization at risk. You may think that your application cannot fail, but at some point everything fails.  It may not be the software running your application that fails – it could be the hardware, the network, or even a natural disaster in your area that causes your application to go down. In case of such failure, no matter how rare, your customers will still expect the same level of service, not to mention preservation of their data.  Without a failover strategy and a tested backup infrastructure you will be out of service for an unknown period of time, which will lead to angry customers and loss of revenue.

Most of you have some failover strategy in place. I’m sure many of you have spent large amounts of time and money ensuring that your application is resilient to failure because you understand your app’s importance. But you may still be missing one key component, without which you are still at risk. That component is monitoring, or more specifically, Application Performance Management (APM). While mission-critical applications rely on an APM system to help monitor application performance and health, APM is often forgotten on the failover systems. APM needs to goes hand-in-hand with any failure testing plan to ensure that your company’s strategy will work in case of a real emergency.

Now that you’ve agreed to incorporate APM into your failure testing plan, how do you actually make this work?  Your application already has an APM solution in place. But now you need to install APM on the backup system, which may not be running (or may not even exist yet). Which solution you choose depends on whether your application is in a data center or in the cloud. Let’s explore each of these scenarios.

Companies that run their applications in one or more data centers usually have a backup data center in a different geographical region. In case of the primary data center fails, the backup data center will handle the entire application load. But will your APM system continue to work as it did on the primary system? If you haven’t installed and configured your APM solution in your backup environment, it won’t.   You will need to install agents from your APM vendor on all of the machines in the backup data center, and then configure it to match what you are monitoring on your primary system.  That requires that you bring up all your backup systems and drive load through them, so you can see that activity in your APM system.  Only then can you be sure the APM system is ready in case of failure.

For applications running in the cloud, things are a bit different.  In most cases there isn’t a backup system already running.  Instead, if a part of the cloud fails, the application must be architected to spin off new nodes in a different region of the cloud that is independent of the failure. As we all learned during Amazon mega failure (#cloudfail) recently, companies must be very careful to understand which parts of the cloud are interdependent. In the case of Amazon’s elastic cloud, moving to a different “availability zone” was not enough to prevent failure – instead, only those who could move to a different geographic region were safe.

Assuming the nodes you spin off in a different region use the same machine image as your primary system, your APM system should not require any further installation or configuration. But that’s the easy part. Problems arise when, during a cloud failure, you rapidly start spinning off new nodes in a different region. This is basically equivalent to a sudden massive burst of new nodes in your system. New nodes coming online create lots of overhead on the monitoring system, which struggles to register the existence of the new nodes and all of their related data.   Companies running very large applications have found that most monitoring systems become unavailable for hours or even crash during such sudden bursts.  Of course a failure is the time your really want your APM system up and working to monitor the success of your failover plan. This is something to keep in mind when choosing an APM vendor – make sure your vendor has a track record in large cloud applications.

Now you need to actually test all this to make sure everything works as expected.  No amount of simulation or architectural review can substitute for a live test run. Be sure you test your failover plan on a regular basis. I’ve seen many companies spend weeks in failover emergency hell because their first attempt at failover didn’t work as expected. But even after you’ve resolved all your breakages in the failover, you are not done.  Just because you are pushing load to the backup system, how do you know if they are handling it successfully? Has the failover impacted application performance for your customers?  Has the failover impacted reliability and availability?  There’s only one way to really know, and that’s from your APM solution’s data. Don’t just failover, but run all your customer load on the backup system, and not just for a night.  Run your entire customer load on your backup system for at least a week to experience all the different workload patterns that vary over the day and week. Monitor your system with an APM solution and see if its performance is the same as you had before the failover.  Carefully compare the data, find any disparities, then fix and retest. Make sure your APM system supports baselines so you can track your performance based on your system’s baseline performance for any time of the day or week.

You may be a bit skeptical at this point. A week seems like overkill, you think. Do I really need to run on the backup system for a week? You’d be surprised how many problems companies have when they run in a failover environment.  Alerts are not set up, backups run on the wrong system, configuration files have not been maintained, software is outdated, and so on.  Plenty of companies spend more than a week flushing out all these issues, so don’t be in a hurry to declare success and failback.

By now you should see that testing for failure must include APM to be successful. Make sure you are using the right APM system – one that makes this testing easier, not harder. Your APM system should be easy to install and require little configuration so that changes in your application are automatically discovered and monitored, both on your primary system and on your backup. Failure of any system is never a good thing, but with proper planning and the right tools it should have no impact on your customers and your business.

Boris.

 

 

 

 

Sandy Mappic

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form