TAG | NPM
Today we are blogging something a little different than our normal. I’m Jim Hirschauer the Operations Guy, and this is my esteemed colleague Dustin Whittle the Developer. In this blog post we’re going to discuss how we would take an application from inception through development, testing, QA, and into production. We’ll each comment on the different stages and provide our perspective on the tools that we need to use at each stage and how they help with automation, testing, and monitoring. Along the way we’ll call out the potential collaboration points to identify the areas where the DevOps approach provides the most value.
The software development loop looks like this:
Inception and working with a product team
From an operational perspective, my first instinct is to understand the application architecture so that I can start thinking about the proper deployment model for the infrastructure components. Here are some of my operational questions and considerations for this stage:
- Are we using a public or private cloud?
- What is the lead time for spinning up each component and ensuring they comply with my companies regulations?
- When do I need to provide a development environment to my dev team or will they handle it themselves?
- Does this application perform functions that other applications or services already handle? Operations should have high-level visibility into the application and service portfolio.
From a development perspective, my first milestone is to make sure the ops team fully understands the application and what it takes to deploy it to a pre-production environment. This is where we the developers sync with the product and ops team and make sure we are aligned.
Planning for the product team:
- Is the project scope well defined? Is there a product requirements document?
- Do we have a well defined product backlog?
- Are there mocks of the user experience?
Planning for the ops team:
- What tools will we use for deployment and configuration management?
- How will we automate the deployment process and does the ops team understand the manual steps?
- How will we integrate our builds with our continuous integration server?
- How will we automate the provisioning of new environments?
- Capacity Planning – Do we know the expected production load?
There’s not a ton of activity at this stage for the operations team. This is really where the devops synergy comes into play. DevOps is simply operations working together with engineers to get things done faster in an automated and repeatable way. When it comes to scaling, the more automation in place the easier things will be in the long run.
Development and scoping production
This should start with a conversation between the dev and ops teams to control domain ownership. Depending on your organization and peers strengths this is a good time to decide who will be responsible for automating the provisioning and deployment of the application. The ops questions for deploying complex web applications:
- How do you provision virtual machines?
- How do you configure network devices and servers?
- How do you deploy applications?
- How do you collect and aggregate logs?
- How do you monitor services?
- How do you monitor network performance?
- How do you monitor application performance?
- How do you alert and remediate when there are problems?
During the development phase the operations focused staff normally make sure the development environment is managed and are actively working to set up the test, QA and Prod environments. This can take a lot of time if automation tools aren’t used.
Here are some tools you can use to automate server build and configuration:
Meanwhile, the operations staff should also make sure that the developers have access to tools which will help them with release management and application monitoring and troubleshooting. Here are some of those tools:
Application + Network Performance Management:
Testing and Quality Assurance
Once developers have built unit and functional tests we need to ensure the tests are running after every commit and we don’t allow regressions in our promoted environments. In theory, developers should do this before they commit any code, but often times problems don’t show up until you have production traffic running under production infrastructure. The goal of this step is really to simulate as much as possible everything that can go wrong and find out what happens and how to remediate.
The next step is to do capacity planning and load testing to be confident the application doesn’t fall over when it is needed most. There are a variety of tools for load testing:
- Apica Load Test - Cloud-based load testing for web and mobile applications
- Soasta – Build, execute, and analyze performance tests on a single, powerful, intuitive platform.
- Bees with Machine Guns – A utility for arming (creating) many bees (micro EC2 instances) to attack (load test) targets (web applications).
- MultiMechanize – Multi-Mechanize is an open source framework for performance and load testing. It runs concurrent Python scripts to generate load (synthetic transactions) against a remote site or service. Multi-Mechanize is most commonly used for web performance and scalability testing, but can be used to generate workload against any remote API accessible from Python.
- Google PageSpeed Insights - PageSpeed Insights analyzes the content of a web page, then generates suggestions to make that page faster. Reducing page load times can reduce bounce rates and increase conversion rates.
The last step of testing is discovering all of the possible failure scenarios and coming up with a disaster recovery plan. For example what happens if we lose a database or a data center or have a 100x surge in traffic.
During the test and QA stages operations needs to play a prominent role. This is often overlooked by ops teams but their participation in test and QA can make a meaningful difference in the quality of the release into production. Here’s how.
If the application is already in production (and monitored properly), operations has access to production usage and load patterns. These patterns are essential to the QA team for creating a load test that properly exercises the application. I once watched a functional test where 20+ business transactions were tested manually by the application support team. Directly after the functional test I watched the load test that ran the same 2 business transactions over and over again. Do you think the load test was an accurate representation of production load? No way! When I asked the QA team why there were only 2 transactions they said “Because that is what the application team told us to model.”
The development and application support teams usually don’t have time to sit with the QA team and give them an accurate assessment of what needs to be modeled for load testing. Operations teams should work as the middle man and provide business transaction information from production or from development if this is an application that has never seen production load.
Here are some of the operational tasks during testing and QA:
- Ensure monitoring tools are in place.
- Ensure environments are properly configured
- Participate in functional, load, stress, leak, etc… tests and provide analysis and support
- Providing guidance to the QA team
Production is traditionally the domain of the operations team. For as long as I can remember, the development teams have thrown applications over the production wall for the operations staff to deal with when there are problems. Sure, some problems like hardware issues, network issues, and cooling issues are purely on the shoulders of operations–but what about all of those application specific problems? For example, there are problems where the application is consuming way too many resources, or when the application has connection issues with the database due to a misconfiguration, or when the application just locks up and has to be restarted.
I recall getting paged in the middle of the night for application-related issues and thinking how much better each release would be if the developers had to support their applications once they made it to production. It was really difficult back in those days to say with any certainty that the problem was application related and that a developer needed to be involved. Today’s monitoring tools have changed that and allow for problem isolation in just minutes. Since developers in financial services organizations are not allowed access to production servers, it makes having the proper tools all the more important.
Production devops is all about:
- deploying code in a fast, repeatable, scalable manner
- rapidly identifying performance and stability problems
- alerting the proper team when a problem is detected
- rapidly isolating the root cause of problems
- automatic remediation of known problems and rapid manual remediation of new problems (runbooks and runbook automation)
Your application must always be available and operating correctly during business hours (this may be 24×7 for your specific application).
In case of failures alerting tools are crucial to notify the ops team of serious issues. The operations team will usually have a runbook to turn to when things go wrong. A best practice is to collaborate on incident response plans.
Finally we’ve made it to the last major category of the SDLC, maintenance. As an operations guy my mind focuses on the following tasks:
- Capacity planning – Do we have enough resources available to the application? If we use dynamic scaling, this is not an issue but a task to ensure the scaling is working properly.
- Patching – are we up to date with patches on the infrastructure and application components? This is supposed to help with performance and/or security and/or stability but it doesn’t always work out that way.
- Support – are we current with our software support levels (aka, have we paid and are we on supported versions)?
- New releases (application updates) – New releases always made me cringe since I assumed the release would have issues the first week. I learned this reaction from some very late nights immediately following those new releases.
As a developer the biggest issues during the maintenance phase is working with the operations team to deploy new versions and make critical bug fixes. The other primary concern is troubleshooting production problems. Even when no new code has been deployed, sometimes failures happen. If you have a great process, application performance monitoring, and a devops mentality collaborating with ops to resolve the root cause of failures becomes easy.
As you can see, the dev and ops perspectives are pretty different, but that’s exactly why those 2 sides of the house need to tear down the walls and work together. DevOps isn’t just a set of tools, but a philosophical shift that needs that requires buy-in from all folks involved to really succeed. It’s only through a high level of collaboration that things will change for the better. AppDynamics can’t change the mindset of your organization, but it is a great way to foster collaboration across all of your organizational silos. Sign up for your free trial today and make a difference for you organization.Link to this post:
Last week I flew into Las Vegas for #Interop fully suited and booted in my big blue costume (no joke). I’d been invited to speak in a vendor debate on User eXperience (UX): Monitor the Application or the Network? NetScout represented the Network, AppDynamics (and me) represented the Application, and “Compuware dynaTrace Gomez” sat on the fence representing both. Moderating was Jim Frey from EMA, who did a great job introducing the subject, asking the questions and keeping the debate flowing.
At the start each vendor gave their usual intro and company pitch, followed by their own definition on what User Experience is.
Defining User Experience
So at this point you’d probably expect me to blabber on about how application code and agents are critical for monitoring the UX? Wrong. For me, users experience “Business Transactions”–they don’t experience applications, infrastructure, or networks. When a user complains, they normally say something like “I can’t Login” or “My checkout timed out.” I can honestly say I’ve never heard them say – ”The CPU utilization on your machine is too high” or “I don’t think you have enough memory allocated.”
Now think about that from a monitoring perspective. Do most organizations today monitor business transactions? Or do they monitor application infrastructure and networks? The truth is the latter, normally with several toolsets. So the question “Monitor the Application or the Network?” is really the wrong question for me. Unless you monitor business transactions, you are never going to understand what your end users actually experience.
Monitoring Business Transactions
So how do you monitor business transactions? The reality is that both Application and Network monitoring tools are capable, but most solutions have been designed not to–just so they provide a more technical view for application developers and network engineers. This is wrong, very wrong and a primary reason why IT never sees what the end user sees or complains about. Today, SOA means applications are more complex and distributed, meaning a single business transaction could traverse multiple applications that potentially share services and infrastructure. If your monitoring solution doesn’t have business transaction context, you’re basically blind to how application infrastructure is impacting your UX.
The debate then switched to how monitoring the UX differs from an application and network perspective. Simply put, application monitoring relies on agents, while network monitoring relies on sniffing network traffic passively. My point here was that you can either monitor user experience with the network or you can manage it with the application. For example, with network monitoring you only see business transactions and the application infrastructure, because you’re monitoring at the network layer. In contrast, with application monitoring you see business transactions, application infrastructure, and the application logic (hence why it’s called application monitoring).
Monitor or Manage the UX?
Both application and network monitoring can identify and isolate UX degradation, because they see how a business transaction executes across the application infrastructure. However, you can only manage UX if you can understand what’s causing the degradation. To do this you need deep visibility into the application run-time and logic (code). Operations telling a Development team that their JVM is responsible for a user experience issue is a bit like Fedex telling a customer their package is lost somewhere in Alaska. Identifying and Isolating pain is useful, but one could argue it’s pointless without being able to manage and resolve the pain (through finding the root cause).
Netscout made the point that with network monitoring you can identify common bottlenecks in the network that are responsible for degrading the UX. I have no doubt you could, but if you look at the most common reason for UX issues, it’s related to change–and if you look at what changes the most, it’s application logic. Why? Because Development and Operations teams want to be agile, so their applications and business remains competitive in the marketplace. Agile release cycles means application logic (code) constantly changes. It’s therefore not unusual for an application to change several times a week, and that’s before you count hotfixes and patches. So if applications change more than the network, then one could argue it’s more effective for monitoring and managing the end user experience.
UX and Web Applications
We then debated which monitoring concept was better for web-based applications. Obviously, network monitoring is able to monitor the UX by sniffing HTTP packets passively, so it’s possible to get granular visibility on QoS in the network and application. However, the recent adoption of Web 2.0 technologies (ajax, GWT, Dojo) means application logic is now moving from the application server to the users browser. This means browser processing time becomes a critical part of the UX. Unfortunately, Network monitoring solutions can’t monitor browser processing latency (because they monitor the network), unlike application monitoring solutions that can use techniques like client-side instrumentation or web-page injection to obtain browser latency for the UX.
The C Word
We then got to the Cloud and which made more sense for monitoring UX. Well, network monitoring solutions are normally hardware appliances which plug direct into a network tap or span port. I’ve never asked, but I’d imagine the guys in Seattle (Amazon) and Redmond (Windows Azure) probably wouldn’t let you wheel a network monitoring appliance into their data-centre. More importantly, why would you need to if you’re already paying someone else to manage your infrastructure and network for you? Moving to the Cloud is about agility, and letting someone else deal with the hardware and pipes so you can focus on making your application and business competitive. It’s actually very easy for application monitoring solutions to monitor UX in the cloud. Agents can piggy back with application code libraries when they’re deployed to the cloud, or cloud providers can embed and provision vendor agents as part of their server builds and provisioning process.
What’s interesting also is that Cloud is highlighting a trend towards DevOps (or NoOps for a few organizations) where Operations become more focused on applications vs infrastructure. As the network and infrastructure becomes abstracted in the Public Cloud, then the focus naturally shifts to the application and deployment of code. For private clouds you’ll still have network Ops and Engineering teams that build and support the Cloud platform, but they wouldn’t be the people who care about user experience. Those people would be the Line of Business or application owners which the UX impacts.
In reality most organizations today already monitor the application infrastructure and network. However, if you want to start monitoring the true UX, you should monitor what your users experience, and that is business transactions. If you can’t see your users’ business transactions, you can’t manage their experience.
What are your thoughts on this?
I did have an hour spare at #Interop after my debate to meet and greet our competitors, before flying back to AppDynamics HQ. It was nice to see many of them meet and greet the APM Caped Crusader.
App Man.Link to this post:
Round Two – Last time I wrote a blog comparing APM versus network-based APM tools, which I still consider NPM at it’s core regardless of what some critics and competitors claim. Let me make one thing clear though, NPM is great for equipping IT network administrators to see how fast or slow data is traveling through the pipes of their application. Unfortunately, network-based APM tools simply cannot provide App Ops granular visibility into the application runtime when isolating bottlenecks go beyond the system level and it’s final destination – the end user’s browser.
I find several of the blogs and YouTube clips from such NPM vendors quite comical as they try to throw punches at APM companies. Their arguments are centered primarily against agent-based approaches being an inadequate APM solution due to today’s fickle and distributed application architectures. It’s not like I haven’t heard it before.
The amusing thing about it…they’re completely right! In fact, we couldn’t agree more, and that’s why Jyoti Bansal founded AppDynamics to address these perennial shortcomings legacy APM vendors have been ignoring. Even the smallest businesses next to the largest enterprises have complex applications that have outpaced their App Ops teams’ current set of monitoring tools. That’s why AppDynamics is reinventing and reigniting the application performance management space by enabling IT operations to monitor complex, modern applications running in the cloud or the data center. So let me respond to those claims they’ve made.
“Agents have high deployment and ongoing maintenance burden.”
Legacy APM: TRUE
AppDynamics: FALSE. No manual instrumentation required. It’s automatic.
“Agents are invasive which can perturb the systems being monitored.”
Legacy APM: TRUE
AppDynamics: FALSE. Our customers see less than 1-2% overhead in production.
All AppDynamics. The next-gen of APM.
I drew a parallel in my previous post that using NPM concepts to monitor application performance is like inspecting UPS packages en-route to figure out why operations at a hub came to a screeching halt. Remember, even if the package contents is visible from afar, it doesn’t explain why the hub conveyors, which electronically guide packages to their appropriate destination chute is broken, nor can it identify why cargo operations have stalled. In other words, good luck trying to gather anything beyond the scope of the application’s infrastructure. Using network monitoring tools to collect even the most basic system health metrics such as CPU utilization, memory usage, thread pool consumption and thrashing? Time to throw in the towel.
And what about End User Monitoring?
What’s becoming just as important as being able to monitor server side processing and network time is the ability to monitor end user performance. When NPM tools are only able to see the last packet sent from the server, how does that help you understand the browser’s performance? It doesn’t since once again, this kind of analysis is only feasible higher up the stack at the Application Layer. And just to clarify when I say Application Layer, I mean application execution time, not “network process time to application” as defined by OSI Layer 7.
On top of that, what about those customers running their applications in a public cloud? Are you going to convince your cloud provider to install a network appliance into their infrastructure? I highly doubt it. With AppDynamics, we have partnerships with cloud providers such as Amazon EC2, Azure, RightScale and Opsource allowing developers and operations to easily deploy AppDynamics with a flick of a switch and monitor their applications in production 24/7.
Once again, next-gen APM triumphs over NPM based application performance on not just the server side, but also the browser. AppDynamics is embracing this and fully aware of the technical and business significance of monitoring end user performance. We’re delighted to offer this kind of end-to-end visibility to our customers who will now be able to monitor application performance from the end users’ browser to the backend application tiers (databases, mainframes), all through a single pane of glass view.Link to this post:
Another three-letter acronym I see frequently mixed in with APM is NPM which stands for Network Performance Management. At first glance they look very similar. The distinction appears very subtle with just a one letter difference, but it speaks volumes because their core technologies and approaches to monitoring application performance are fundamentally different.
Application Performance Management tools typically use agents that live in the application run-time which capture performance details of how application logic computes across the application infrastructure.
In contrast, Network Performance Monitoring tools are agent-less appliances that sit on the network analyzing content and traffic by capturing packets flowing through the network. They can measure response times and find errors for your applications by understanding the wire protocols across the application tiers. Like agent based APM solutions, they can automatically discover your application topology, but NPM tools lack deep code-level diagnostics which is paramount to helping App Ops and Dev teams solve problems. Here’s why.
Imagine your application’s architecture is FedEx providing package delivery services to it’s customers in a timely manner. If you were to apply the concept of network monitoring to this analogy, using an NPM tool would help track how long it takes for packages to travel in transit from one hub to the next, examining it’s contents and the expected arrival time to it’s final destination.
Now let’s say that a package was being sent from Paris to Los Angeles but hit a snag at the London hub along the way. Operations at London came to a screeching hault for some inexplicable reason and your task is to figure out why and how to fix it ASAP so operations are running smooth again. In this case, APM is the favorable solution over NPM here. APM tools not only empower you with the ability to see how long it takes for your package to travel to different locations, but also how the package is processed at the facility and why operations came to a standstill. At this juncture, you probably wouldn’t even care to analyze the package’s contents since you already know what’s inside and it’s still in London.
When applying this concept to the world of business applications, NPM tools won’t provide you with visibility into application logic inefficiencies, especially when experiencing a performance issue or service level breach to key business transactions. NPM performance metrics are derived from captured packets and the network protocols the servers support, but if you’re making inefficient, iterative business logic calls in your code, how will capturing and analyzing packets help you triage the problem? Its the same story for identifying the root cause of memory leaks or thread synchronization issues; execution and visibility like this occurs at the application layer, not at the network layer. Thus, I would argue that NPM compliments APM but doesn’t serve as a viable replacement.
Now before I go enumerating my theories as to why I see this trend, let me just say while I respect their hustle, nice try but not quite. While scouring the web I found many NPM companies trying to crash the popular APM party by marketing their network performance monitoring utilities as an application monitoring and management solution. It can be confusing for a first time buyer trying to evaluate APM tools when you see NPM companies using phrases like, “network-based APM” or “gather data already on your network” in their product messaging.
One of the reasons for this messaging is because NPM is an immensely crowded market. Here’s a list of NPM tools compiled by some IT folks at Stanford. There’s over 300 network monitoring tools, and that’s only going back five years to 2007!! Take your pick. Second, applications and the tools to monitor their performance and availability speaks to customers at a higher level from a technical and business perspective. If I’m in charge of some transaction intensive application, my customers are directly interfacing with the Application Layer in order to complete business transactions which takes place at the highest layer on the stack. Customers aren’t 5-6 levels deep at the Network Layer interacting with packets or the routers and switches they pass through.
By the way, have you seen Gartner’s APM spend analysis? It’s a $2 billion market and growing. This most certainly is a valid reason why NPM solutions are now positioning themselves as an APM solution to customers.
If you have absolutely nothing to monitor any part of your applications infrastructure, then something is better than nothing. However, at some point you’ll need better visibility at the application layer regardless how fast packets are traveling through the pipes. NPM won’t necessarily provide the range of visibility your App Ops and Dev teams need to diagnose, troubleshoot, and idenfity problem root causes.
Those who live day in and day out the networking world will naturally gravitate to what they’re familiar with, but if you haven’t ventured out into the evolving APM market, I highly recommend it. Even though you can achieve some level of performance monitoring with NPM, you’ll find yourself running into limitations in application visibility pretty quick. But as the old saying goes, “If all you have is a hammer, then everything looks like a nail.”
I’m curious to know what are your thoughts are. Which class of tools would you choose? Why?
Link to this post: