Application Runbook Automation – A Detailed Walk Through

image_pdfimage_print

On Monday AppDynamics announced a new feature called Application Runbook Automation (RBA). The response to this announcement has been great and many people want to see the details on how we implement RBA within AppDynamics. If you attended one of our customer webinars for AppDynamics 3.7 Sneak Peek then you got to see RBA in action during a live demo. Here is a link to the webinar recording in case you want to see for yourself. After all, a video is worth a million words. Otherwise, I’m going to walk you through it step by step in this blog post.

If you don’t already know WHY we built Application RBA please read “Don’t Be An “Also-Ran” – Application Runbook Automation for World Class IT”

Let’s jump right in…

I have an application that is tuned perfectly for normal load but every once in a while I get a massive rush of activity that causes my application to get really slow. My APM tool tells me this is happening because I am exhausting my database connection pool from the excessive load.

Instead of manually having to adjust my connection pool size every time a major utilization spike occurs I can just create a Runbook and associate it with a policy so that it will fire when the connection pool is exhausted. Here’s how we do it…

The RBA menus (Policies, Health Rules, and Actions) are found in the “Alert and Respond” section of the AppDynamics UI.

RBA Menus

We select the Policy menu item and click the “Create Policy” button which opens our Create Policy dialogue.

Create Policy

First thing we want to do is provide a sensible name for our new policy. I chose “Extend Resource Pool”. Next we need to select the event that acts as a trigger for our Runbook. In this case we choose the “Resource Pool Limit Reached” event and click on the “Next” button. This opens up the “Actions” dialogue shown below.

Policy Actions

Clicking on the green + icon allows us to add pre-existing actions or create new ones to use within our policy. In this case, we will click on the “Create Action” button to generate the proper actions required for remediation of this problem.

Create Action

In the Create Action dialogue we select the “Run a script or executable on problematic Nodes” radio button and hit the “OK” button to continue. This leads us to the “Create Remediation Script Action” dialogue. We provide a name for our action “Increase Resource Pool Script”, path to our script, location where we want our log files saved, script timeout threshold, and decide if we need this action authorized by a human before being executed or not.

Remediation Script

Once we click OK the next dialogue is important and powerful. This is where we determine if the remediation action will be executed on all of the impacted nodes, a percentage of impacted nodes, or a defined number of nodes. You probably don’t want to run Thread Dumps on all of your impacted nodes at the same time so this is a great way to limit the scope of your remediation action if needed. In our case we want every node repaired right away so we have selected 100% of impacted nodes.

Configure Action

We save our action and notice that the new action is now shown in the “Extend Resource Pool” Policy actions box. We can add as many actions to an individual policy as are required to gather data, remediate, and alert. When we are done adding actions we save our work and our new policy is shown in the AppDynamics UI list of Policies.

NewPolicyWith Action

So what’s the end result of our work? Our application is currently running under load. In the top right corner of the application flow map is the Events panel. We see 1 event in there and it is categorized as a “Code Problem”.

Code Problems

Clicking on that event we launch into the events workspace. We see a description of the event and that our remediation script was executed (We increased the size of the database connection pool). We can explore the event further if we choose to but for this blog we will just jump to the actual results of our action.

Resource Limit Reached

By looking at the chart shown below we can see that as our load increased the average response time (blue line) of our transactions was steadily increasing to almost 10 seconds. Meanwhile the transaction throughput (green and orange bars) remained low during the period where our connection pool was a bottleneck. You can see the point at 8:17 AM where the remediation runbook automatically kicked in and increased the size of the connection pool for us. This alleviated our resource contention and throughput increased dramatically while response time improved to around 1 second.

Review of results

This is just one simple but powerful example of what you can do with Application Runbook Automation from AppDynamics. Request your free trial of AppDynamics Pro today and see what we can do for your applications.

  • Henry Steinhauer

    How detailed do I need to make these actions? Are there keywords that are passed to the action so I know which pool is having the problem and therefore which one needs to be increased?

    Living in both worlds (JAVA Pool Sizes and the DB Total Pool) there can be impacts if the JAVA pool size is now larger than the DB Total Pool size.

    Are there considerations that once this is done, it should not be done again for a certain period of time? Or that others should be called into the action if the rules would determine that this needs to be done yet again with a short period of time?

    I have hundreds of pools when I look at the applications in my domain. Do I need hundreds of rules to address each of these pools? Is there some scripting that can be done to make this general purpose?

    • Jim Hirschauer

      Hi Henry, great questions! Here are my answers…

      Q – “How detailed do I need to make these actions? Are there keywords that are passed to the action so I know which pool is having the problem and therefore which one needs to be increased?”

      A – There are several environment variables that are exported to the scripts environment: e.g. event type, affected entity, etc.

      Q – “Living in both worlds (JAVA Pool Sizes and the DB Total Pool) there can be impacts if the JAVA pool size is now larger than the DB Total Pool size.

      Are there considerations that once this is done, it should not be done again for a certain period of time? Or that others should be called into the action if the rules would determine that this needs to be done yet again with a short period of time?”

      A -That could happen, therefore, you want to make sure you size the pools so you don’t overflow. The events fire every minute until its remediated. In a production environment you would probably want to increase the pool more stepwise… e.g. increase by x connections at a time.

      Q – “I have hundreds of pools when I look at the applications in my domain. Do I need hundreds of rules to address each of these pools? Is there some scripting that can be done to make this general purpose?”

      A – You can make the script general purpose.

      Hopefully this reply answers your questions in a meaningful way. I’d suggest you request a free 30 day trial of our 3.7 Pro version when it is released as that is how you can put everything through it’s paces for your exact environment and needs.

  • Pingback: Crabs and IT Operations - Different but the Same - AppDynamics: The APM Blog

  • Pingback: My Top 3 Automated Tasks for Finding and Fixing Problems - AppDynamics: The APM Blog

Copyright © 2014 AppDynamics. All rights Reserved.