Migrating to the cloud — let APM Analytics Show You the Way

By | | 4 min read



Transitioning distributed applications to the fast-changing environment of the cloud is a complex and risky process.  How do you move out of the safety of your data center, where you have successfully run for years (if not decades), without sacrificing the performance of your applications?

There is no easy way to predict your application’s performance on cloud resources.   With such technical names as “small,” “medium,” and “large,” how can you even begin to estimate capacity needs for your application?  The cloud is a mysterious place with computing resources that bear no resemblance to the systems in the data center, and a successful transition will require arduous analysis.

But if you use Analytics on top of your rich Application Performance Management (APM) data, you can do all this in a few straightforward steps.  By comparing performance data from your current application to that from your test environment in the cloud, you can accurately estimate your cloud capacity requirements.

There are some key prerequisites before moving forward.  First, you need access to performance data from your APM system (you do have an Application Performance Management system, don’t you?)   Second, you need to test your app with simulated load on the cloud environment that you intend to use.   Most importantly, your APM system must run on both old and new environments and record the performance characteristics.

Our goal is to compare the capacity of a machine in our data center against the capacity of computing nodes in the cloud to discover just how many nodes we’ll need to match the performance of our data center. To do this, we compare the throughput of a single application running in our data center against the throughput for the new cloud environment.

The unit of measurement to use for this is RATU — the Response Acceptable Throughput per computing Unit.  Let me explain.

The graph below shows measured Response-Times at varying levels of load (Throughput). Once we decide what Response-Time is currently acceptable for our application, we can find the largest load that meets this limit within, say, 2 standard deviations.  Now simply divide this throughput number by the number of machines the application is running on. (Note: for simplicity I’m assuming the hardware is uniform, but if it’s not then you can compute a weighted average.)

Now, as the application isn’t just one atomic unit, you’ll need to measure the throughput for each high-level transaction. You need to perform this calculation across all your Business Transactions to get an accurate result. It is important to examine individual Business Transactions to ensure every business facing part of your application has been accounted for. Calculate the RATU for each of your relevant Business Transactions and take the average. You can customize this process a bit by excluding some transactions or even by doing a weighted average based on the importance or volume of the Business Transaction.

Now run the same analysis (calculating the RATU) on your test environment in the cloud.

To get accurate numbers, you’ll want to take a measurement over a long enough time to cover at least one full workload cycle, if not more.  As a rule of thumb I’d recommend a month of data, as that should cover most variations in the application’s lifecycle.

You should now have 2 numbers: RATU for the old data center environment and RATU for our cloud. Simply divide the first by the second to see what percentage more cloud units you’ll need to achieve the same performance as your current environment.

Let’s try an example using Amazon’s elastic cloud (EC2). Say we are looking at an “Asset Tracking” application. We have 100 large computers in our data center running this application. We also ran a test in the cloud of this same application, on 20 “large” size ec2 nodes. We’ve run analysis on one month of data and found that our data center RATU is 10,000, while our RATU in the cloud is 1,000.  Therefore: 10,000 divided by 1,000 = 10.

This means we need 10 times more cloud units than data center hardware units we currently use to achieve satisfactory throughput. Since we were testing on 100 datacenter machines, we now know that we need 1000 “large” nodes if we want to achieve the same throughput when we migrate to the cloud. (Make sure you have enough money for all those nodes before you start!)

You may be saying to yourself, wow this would be great, if only my APM system actually had analytics. Lucky for you, here at AppDynamics, we provide our own Analytics for this type of analysis. All you have to do is specify a few parameters and you’ll have your data. Give it a try!