CAT | Automation
Jean-Pierre (JP) Garbani of the analyst firm Forrester recently wrote a research paper titled “Technology Spotlight: Automate Application Performance Management - Automate Your Incident Management Process With Run Book Automation“. In this paper JP discusses the fact that IT must embrace automation if they are to be successful moving forward. He goes on to describe Run Book Automation (RBA) and how it applies to Application Performance Management (APM).
One of the most interesting parts of the research note is a graphical depiction of the intersection between APM and RBA. While I can’t show that image in this blog post I can say that it is a spot on depiction of how RBA works within AppDynamics. For those who are not familiar, AppDynamics decided to lead the APM industry in a new direction by listening to our customers and making RBA an important and fully integrated part of our product.
If you take nothing else away from this blog post or from JPs research paper you need to understand this key insight… The reason APM based RBA works so much better than traditional RBA is because APM understands exactly which application nodes are impacted at any given time and can perform run book remediation on those nodes without any input from a user.
The Old Way
Traditional RBA requires that you write tests to act as triggers for run book workflows. You would run these tests against pre-defined sets of infrastructure and application components when there is a problem so that your runbooks can fix any known issues. If your application and infrastructure change you need to manually modify the list components that are getting tested. This manual update process has been the downfall of RBA and other technologies (CMDB anyone?) as people have come to realize the time investment required to keep everything current. This problem is amplified by todays dynamic application technologies like virtualization and cloud computing.
The New Way
Forrester and AppDynamics agree that the answer to the traditional RBA problem described above is by using APM to dynamically track the current state of application and infrastructure components as well as identify problems that trigger run book workflows for resolution. Identification of issues from within the application and in real time is a giant step forward from the pre-determined interval testing of traditional RBA. And when you combine this capability with real time business metrics you get a new capability that enables the business to react immediately to problems that have nothing to do with IT.
Application run book automation can be used by any organization large or small. If there are issues within your environment that have known fixes then you can use application RBA to automatically detect and remediate those problems within seconds. Stop wasting your time doing repetitive tasks and try AppDynamics for free today. If you’d like to read the Forrester research paper in its entirety you can download it for free by clicking here.Link to this post:
Today we are blogging something a little different than our normal. I’m Jim Hirschauer the Operations Guy, and this is my esteemed colleague Dustin Whittle the Developer. In this blog post we’re going to discuss how we would take an application from inception through development, testing, QA, and into production. We’ll each comment on the different stages and provide our perspective on the tools that we need to use at each stage and how they help with automation, testing, and monitoring. Along the way we’ll call out the potential collaboration points to identify the areas where the DevOps approach provides the most value.
The software development loop looks like this:
Inception and working with a product team
From an operational perspective, my first instinct is to understand the application architecture so that I can start thinking about the proper deployment model for the infrastructure components. Here are some of my operational questions and considerations for this stage:
- Are we using a public or private cloud?
- What is the lead time for spinning up each component and ensuring they comply with my companies regulations?
- When do I need to provide a development environment to my dev team or will they handle it themselves?
- Does this application perform functions that other applications or services already handle? Operations should have high-level visibility into the application and service portfolio.
From a development perspective, my first milestone is to make sure the ops team fully understands the application and what it takes to deploy it to a pre-production environment. This is where we the developers sync with the product and ops team and make sure we are aligned.
Planning for the product team:
- Is the project scope well defined? Is there a product requirements document?
- Do we have a well defined product backlog?
- Are there mocks of the user experience?
Planning for the ops team:
- What tools will we use for deployment and configuration management?
- How will we automate the deployment process and does the ops team understand the manual steps?
- How will we integrate our builds with our continuous integration server?
- How will we automate the provisioning of new environments?
- Capacity Planning – Do we know the expected production load?
There’s not a ton of activity at this stage for the operations team. This is really where the devops synergy comes into play. DevOps is simply operations working together with engineers to get things done faster in an automated and repeatable way. When it comes to scaling, the more automation in place the easier things will be in the long run.
Development and scoping production
This should start with a conversation between the dev and ops teams to control domain ownership. Depending on your organization and peers strengths this is a good time to decide who will be responsible for automating the provisioning and deployment of the application. The ops questions for deploying complex web applications:
- How do you provision virtual machines?
- How do you configure network devices and servers?
- How do you deploy applications?
- How do you collect and aggregate logs?
- How do you monitor services?
- How do you monitor network performance?
- How do you monitor application performance?
- How do you alert and remediate when there are problems?
During the development phase the operations focused staff normally make sure the development environment is managed and are actively working to set up the test, QA and Prod environments. This can take a lot of time if automation tools aren’t used.
Here are some tools you can use to automate server build and configuration:
Meanwhile, the operations staff should also make sure that the developers have access to tools which will help them with release management and application monitoring and troubleshooting. Here are some of those tools:
Application + Network Performance Management:
Testing and Quality Assurance
Once developers have built unit and functional tests we need to ensure the tests are running after every commit and we don’t allow regressions in our promoted environments. In theory, developers should do this before they commit any code, but often times problems don’t show up until you have production traffic running under production infrastructure. The goal of this step is really to simulate as much as possible everything that can go wrong and find out what happens and how to remediate.
The next step is to do capacity planning and load testing to be confident the application doesn’t fall over when it is needed most. There are a variety of tools for load testing:
- Apica Load Test - Cloud-based load testing for web and mobile applications
- Soasta – Build, execute, and analyze performance tests on a single, powerful, intuitive platform.
- Bees with Machine Guns – A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
- MultiMechanize – Multi-Mechanize is an open source framework for performance and load testing. It runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service. Multi-Mechanize is most commonly used for web performance and scalability testing, but can be used to generate workload against any remote API accessible from Python.
- Google PageSpeed Insights - PageSpeed Insights analyzes the content of a web page, then generates suggestions to make that page faster. Reducing page load times can reduce bounce rates and increase conversion rates.
The last step of testing is discovering all of the possible failure scenarios and coming up with a disaster recovery plan. For example what happens if we lose a database or a data center or have a 100x surge in traffic.
During the test and QA stages operations needs to play a prominent role. This is often overlooked by ops teams but their participation in test and QA can make a meaningful difference in the quality of the release into production. Here’s how.
If the application is already in production (and monitored properly), operations has access to production usage and load patterns. These patterns are essential to the QA team for creating a load test that properly exercises the application. I once watched a functional test where 20+ business transactions were tested manually by the application support team. Directly after the functional test I watched the load test that ran the same 2 business transactions over and over again. Do you think the load test was an accurate representation of production load? No way! When I asked the QA team why there were only 2 transactions they said “Because that is what the application team told us to model.”
The development and application support teams usually don’t have time to sit with the QA team and give them an accurate assessment of what needs to be modeled for load testing. Operations teams should work as the middle man and provide business transaction information from production or from development if this is an application that has never seen production load.
Here are some of the operational tasks during testing and QA:
- Ensure monitoring tools are in place.
- Ensure environments are properly configured
- Participate in functional, load, stress, leak, etc… tests and provide analysis and support
- Providing guidance to the QA team
Production is traditionally the domain of the operations team. For as long as I can remember, the development teams have thrown applications over the production wall for the operations staff to deal with when there are problems. Sure, some problems like hardware issues, network issues, and cooling issues are purely on the shoulders of operations–but what about all of those application specific problems? For example, there are problems where the application is consuming way too many resources, or when the application has connection issues with the database due to a misconfiguration, or when the application just locks up and has to be restarted.
I recall getting paged in the middle of the night for application-related issues and thinking how much better each release would be if the developers had to support their applications once they made it to production. It was really difficult back in those days to say with any certainty that the problem was application related and that a developer needed to be involved. Today’s monitoring tools have changed that and allow for problem isolation in just minutes. Since developers in financial services organizations are not allowed access to production servers, it makes having the proper tools all the more important.
Production devops is all about:
- deploying code in a fast, repeatable, scalable manner
- rapidly identifying performance and stability problems
- alerting the proper team when a problem is detected
- rapidly isolating the root cause of problems
- automatic remediation of known problems and rapid manual remediation of new problems (runbooks and runbook automation)
Your application must always be available and operating correctly during business hours (this may be 24×7 for your specific application).
In case of failures alerting tools are crucial to notify the ops team of serious issues. The operations team will usually have a runbook to turn to when things go wrong. A best practice is to collaborate on incident response plans.
Finally we’ve made it to the last major category of the SDLC, maintenance. As an operations guy my mind focuses on the following tasks:
- Capacity planning – Do we have enough resources available to the application? If we use dynamic scaling, this is not an issue but a task to ensure the scaling is working properly.
- Patching – are we up to date with patches on the infrastructure and application components? This is supposed to help with performance and/or security and/or stability but it doesn’t always work out that way.
- Support – are we current with our software support levels (aka, have we paid and are we on supported versions)?
- New releases (application updates) – New releases always made me cringe since I assumed the release would have issues the first week. I learned this reaction from some very late nights immediately following those new releases.
As a developer the biggest issues during the maintenance phase is working with the operations team to deploy new versions and make critical bug fixes. The other primary concern is troubleshooting production problems. Even when no new code has been deployed, sometimes failures happen. If you have a great process, application performance monitoring, and a devops mentality collaborating with ops to resolve the root cause of failures becomes easy.
As you can see, the dev and ops perspectives are pretty different, but that’s exactly why those 2 sides of the house need to tear down the walls and work together. DevOps isn’t just a set of tools, but a philosophical shift that needs that requires buy-in from all folks involved to really succeed. It’s only through a high level of collaboration that things will change for the better. AppDynamics can’t change the mindset of your organization, but it is a great way to foster collaboration across all of your organizational silos. Sign up for your free trial today and make a difference for you organization.Link to this post:
Automation sets apart organizations at the top of their game from the rest of the pack. The limiting factor in most organizations is that they are usually too busy putting out fires and keeping up with all of their other obligations to expend the effort required to envision and build out their automation strategy. With that in mind I have created a small list of the automation tasks that I feel provide the most value to an organization. Along with these tasks I explain the type of effort involved and reward associated with each one. All of the information presented is based upon my 15 years of troubleshooting within enterprise operations and applications environments.
1. Collect troubleshooting metrics – To me this is the no-brainer of automation tasks. For each particular type of problem you always want certain information to help resolve that issue. This is also the easiest of my top three to implement but may provide less value than the other two. Here are some examples…
- Hung/Unresponsive JVM/CLR – Initiate and store thread dump, restart application on offending node.
- Slow transactions plus high server CPU utilization – Collect process listing to determine CPU contributors and spin up extra instance of application.
- Transactions throwing excessive errors – search log files for errors and send list to appropriate personnel, based upon error type possibly probe individial components deeper (see #2 below)
2. Probe application components – This one is really useful for figuring out difficult application problems but requires more effort to set up than #1. The basic concept is that you need to find out from the application support team what steps they would take manually to trouble shoot their application if it were slow or broken. The usual responses are things like “Check this log file for this word”, “Run this query against this database and if the output is -3 then I know what the problem is”, “Hit this URL to see what the response looks like. If the page does not return properly I know there is a problem with this component”, etc…
It may seem like a lot of work to set up this type of automated probing at first but once it is set up it becomes an invaluable troubleshooting tool. Imagine that the application response times get slow so these troubleshooting measures are automatically invoked and you get an email with the exact root cause within minutes of performance degradation. With this type of automation there is usually a known resolution based upon the probing results so that could be automated too. How handy is that?
3. Alert the business to changing conditions – This is where the IT staff has the ability to really make an impression and impact on the business. One of the most overlooked aspects of monitoring and automation is the ability to gather, baseline, alert, and act based upon pure business metrics. Here’s an example scenario…
Bobs business has an e-commerce website that sells many different products. They use AppDynamics Pro to track the quantity of each item sold along with the total revenue of all sales throughout the business day (using the information point functionality). One day Bob gets an alert that the latest greatest widgets are selling way below their typical volume but the automation engine searched the prices of major competitors to find that one competitor lowered prices and is undercutting business. With this actionable information Bob is able to immediately match the pricing of the competition and sales rates return to normal before too much damage is done.
Obviously business operations automation can be the most complex but also the most rewarding. Reaching out to the business and having a conversation about their activities and processes can seem daunting but I can tell you from personal experience that the business is more than willing to participate in initiatives that save them time and money. This type of conversation also normally leads to the business asking about other ways to collaborate to make them more effective so it is a great way to improve the overall level of communication with the business.
AppDynamics Pro enables a new level of automation because it knows exactly what is going on inside of your applications from both a business and IT perspective. Your level of automation is limited only by your imagination. I recommend that you start out small with a single automation use case and build outward from there. You can use AppDynamics Pro for free to try out our application runbook automation functionality on your specific use case by clicking here.Link to this post: