TAG | DevOps

My favorite TV channel at the moment is Speed TV. Last week we had the Monaco Formula 1 Grand Prix where RedBull (fizzy drinks company) beat Ferrari – the most prestigious name in motorsport history! Formula 1 is all about finding performance through innovation and change. F1 teams have up to a hundred engineers working flat out on every aspect of the car from aero to engine maps hoping to find a few tenths of a second to out-qualify, race and beat the competition.

The development of a F1 car is relentless with testing sessions only allowed during pre-season in order to reduce costs. Teams can only test new ideas and car updates at race weekends when they’re head to head with competitors, and limited to just 6 hours of track time. There is no such thing as a test environment, teams can’t just reboot or replace their car when a driver hits a wall or engine blows up. A component failure or crash means less track time to monitor and find the optimal car setup. They basically operate in a live production environment. With some teams bringing up to 35 different car updates for every race, it’s down to the engineers and support teams to continuously monitor and figure out which updates deliver the most improvement, pace and reliability. Not finishing a race or being slow has a massive impact on prize money and sponsorship loyalty.

Read the Full Post…

Link to this post:

, , , , , ,

App Man

Agile Operations

Greg Howard

Gartner on Application Uptime & Availability

Attendees of the 2010 Gartner Data Center Conference exhibited a lot of interest around Application Performance Management, as evidenced by the packed room where Gartner analyst Donna Scott discussed application availability in her session “Uptime All the Time.”

Read the Full Post…

Link to this post:

, ,

San Francisco’s QCon was expecting a smaller crowd, but ended up bursting at the seams: the event sold out weeks ahead of time and in many sessions it was standing room only.

Targeted at architects and operations folks as much as developers, QCon was heavy on the hot topics of the day: SOA, agile, and DevOps. But if there was a consistent trend throughout the three days, it was “No more theory. Show us the practice.”

At Jesper Boeg’s talk for example—“Raising the Bar: Using Kanban and Lean to Super Optimize Your Agile Implementation”—the talk was peppered with some good sound bites (“If it hurts, do it more often and bring the pain forward”). But it also stressed the meat: Boeg demonstrated a “deployment pipeline” that represented an automated implementation of the build, deploy, test, and release process—a way to find and eliminate bottlenecks in agile delivery.

Similarly, John Allspaw started high in his talk—sharing his ideas on the areas of ownership and specialization between Ops and Dev, a typical DevOps presentation—but backs up the theory with code-level discussions of how logging, metrics, and monitoring works at Etsy.  (His blog entry on the subject and complete Qcon slides can be found on his blog, Kitchen Soap.)

Adrian Cockroft, who is leading a trailblazing public cloud deployment of production-level applications at Netflix, also wrapped theory around juicy substance. He “showed the code” and the screenshots of his company’s app scaling and monitoring tools (you can find his complete slide presentation here).

Not everyone took the time to drill down, though. Tweets from QCon attendees showed that the natives got restless in talks that stayed too high level:

“OK, just because you can draw a block diagram out of something doesn’t mean it makes sense.”

“Ok, we get it. Your company is very interesting, now get to the nerd stuff.”

“These sessions are high-level narratives. Show me the code, guys! Devil’s in the details.”

At the same time, they would shower plaudits and congratulations on speakers who gave them what they wanted: something new to learn.

When the Twitter stream started to compare QCon’s activities with an event happening concurrently in the city, Cloud Expo, the nature of the attendees was draw into sharp relief:

“At #cloudexpo people used laptops during sessions to check email… At #qconsf they are writing code.”

When it comes to agile, SOA, DevOps, and other problems of the day, people are ready for answers.

Link to this post:

, , , ,

In my last post, I wrote about how using Business Transactions as a management unit is critical for managing modern-day applications efficiently. Sticking to this train of thought, I will focus on how this applies to various aspects of application management. The first area I want to cover is monitoring and troubleshooting.

In a highly distributed system, there are 100s of CPUs, 100s of JVMs/CLRs, and millions of lines of code running. Now, if you want to attach service levels to every component of that system, you would either be eyeballing dashboards most of the time or trying to maintaining the configuration associated with alerting.

As I mentioned in the last post, if the business grows, it will require more capacity and more infrastructure—which means newer pieces (and ones that are moving around rapidly). Add the configuration attached with lines of code and you can pick either “daunting” or “impossible” to describe the task at hand. Long testing and staging cycles have become a thing of the past.

At the same time, the DevOps tribe is adapting to and embracing the new application landscape rapidly. Their biggest need is the need for speed, which translates into efficiency driven by intelligence in every aspect of application management in production. This is where using Business Transactions can make monitoring more efficient and easier to accomplish.

But wait a minute – am I really suggesting that you watch Business Transaction service levels (which are your business’ bottom line) instead of getting overwhelmed with alerts based on 1000s of key metrics? Doesn’t that mean you are ignoring things that can be going wrong by not giving them enough attention?

Au contraire! In fact, you can actually focus on more with this approach versus traditional monitoring techniques. Let me explain.

Traditional monitoring has all been about averages. You look at some service or some method and its average over time, then set up alerts associated with it. Doing so is great for catching slow performance degradation or systemic outages. But it won’t help you catch outliers where there is no pattern associated with errors or slow requests.

Let’s look at a couple of examples to understand this better. For brevity I will talk only about slow requests (but the same argument can also be applied to errors).

1) A frequent cause of slow requests, resulting in a poor user experience, pertains to a transaction associated with user input.  For example, in a shopping cart application, the user might add a particular item to his or her cart—which results in the application slowing down.

2) Here’s another one. Sometimes one particular node, which is part of a big cluster of say 50 nodes, has an issue in servicing requests but might be doing ok on CPU and memory usage.

In both of these cases, averages would hide the problem. Here’s an example.

Scenario
Over a period of time, an online checkout application experienced 5,000 Total Checkouts. Of those checkouts, 4,800 were Normal Checkouts and 200 were Slow Checkouts.

In cases like these, it is very likely that the average response time of the normal transaction—along with that of the bad transactions—averages out to a very normal rate. But the bottom line is that 200 users had a bad checkout, and that needs to be fixed. By focusing on business transactions only and reducing the points of monitoring, you are actually able to identify and address more performance concerns than you might have otherwise.

So what exactly are we watching here?

a) Response time for a transaction – this is averaged over all requests for this transaction. We don’t need to watch the key methods being executed in the request since the overall response time for the request is the single indicator. A method-level drill down is needed only when the response time is slow.

b) Number of slow requests over time – If we set up thresholds over the transaction and watch every request, we are able to identify exactly how many requests were outliers. Of course, for this to be useful, the system needs to be able to collect diagnostic information as slow requests happen and not afterwards. This also ensures that when the request is being serviced by a problematic node, it is represented appropriately.

We’re not the only ones who believe that monitoring Business Transactions is critical to better performance in production. The customers and prospects we speak to are serious about having a system in place that can watch all requests instead of just watching averages in their system. It’s becoming a real-world demand with real-world benefits.

The last thing I want to mention before signing off, is that AppDynamics does all of the above. We follow all requests to monitor SLAs—not just averages, but actual numbers—and we do so at an extremely low overhead, enabling us to jump in and get diagnostic data as bad requests happen!

That wasn’t too much of a plug, right? Until next time…

Link to this post:

, , ,

This summer, I had a chance to sit on a panel at DevOps Days with a few industry colleagues to discuss the role of monitoring, testing, and performance in solving DevOps issues. It was a lively discussion on the teams, tools, and techniques that play a role in managing application and network performance.

Read the Full Post…

Link to this post:

, , ,

Steve Roop

AppDynamics Lite Turns 5000!

It’s a big day for us here at AppDynamics as we mark (and pass!) the 5000th download of AppDynamics Lite! Released just a few months ago, AppDynamics Lite is our free APM tool for troubleshooting Java performance in production.  We hear every day from companies who have deployed Lite in production environments and are using it to solve real problems.  That’s why we built it, and it’s very rewarding to hear about IT Operations professionals and developers getting real value.

We’re very pleased with the wave of enthusiasm that the tool has been able to generate in such a short time.   Here’s just one example: one of our sales reps was speaking to a company this morning, and he asked how they had heard of us.  The prospect said, “I downloaded and used your Lite tool, which my team thought was phenomenal. One of our business requirements was that our potential vendors could give us immediate access to their product and demonstrate their unique approach to application performance management.  You met this requirement–no one else we talked to did. And therefore, you immediately made our short list.”

This conversation represents another goal of ours: removing the friction that typically exists between software companies and evaluators. With Lite, companies can download it whenever they want and put it immediately into action.  No waiting, no negotiating…just instant access.  This strategy works for us because, as a startup, we believe that if someone becomes familiar with our solution, they’ll see that it’s considerably better than the legacy offerings on the market.

The success of the tool continues to demonstrate that there’s unquestionably a need from both dev and ops teams to have greater insight and understanding of performance as they deploy applications across cloud, virtual and physical environments. Companies big and small are taking advantage of AppDynamics Lite already; you might even recognize some of the names (Ikea, Nokia, MasterCard, Dell, FranklinCovey and Samsung among others).

The great thing about AppDynamics Lite (besides the fact that it takes two minutes to download and it’s free), is that with thousands of users worldwide we’ve been able to gather a lot of product feedback in just a few months. Lite has now been deployed on every conceivable combination of Java application servers and application stacks. This means our core agent technology has been battle-tested in many environments, and we’ve been able to incorporate all that feedback into our commercial product. This is quite different than the usual slow feedback loop that exists in traditional enterprise software go-to-market approach. The benefit to our customers and partners is that our technology is mature beyond its years.

As we’ve always promised, we’ll continue to offer AppDynamics Lite for free. Our goal is to empower developers and IT operations teams to combat one-off performance issues as they occur, while giving them a first-hand look at the AppDynamics approach to application performance management.
Link to this post:

, , , , , ,

This post is part of our series on the Top Application Performance Challenges.

Over the past couple of weeks we helped a few DevOps groups troubleshoot performance problems in their production Java-based applications. While the applications themselves bore little resemblance to one another, they all shared one common problem. They were full of runtime exceptions.

When I brought the exceptions to the attention of some of the teams I got a variety of different answers.

  • “Oh yeah, we know all about that one, just some old debug info … I think”
  • “Can you just tell the APM tool to just ignore that exception because it happens way too much”
  • “I have no idea where that exception comes from, I think it’s the vendor/contractor/off shore team’s/someone else’s code”
  • “Sometimes that exception indicates an error, sometimes we use it for flow control”

The response was the same – because the exception did not affect the functionality of (read: did not break) the application it was deemed unimportant. But what about performance?

In one high-volume production application (100’s of calls/second) we observed an NPE (Null Pointer Exception) rate of more than 1000 exceptions per minute (17,800/15 min). This particular NPE occurred in the exact same line of code and propagated some 20 calls up in the stack before a catch clause handled it. The pseudo code for the line read if the string equaled null and the length of the trimmed string was zero then do x otherwise just exit. When we read the line out loud the problem was immediately obvious – you cannot trim a null string. While the team agreed that they needed to fix the code, they remained skeptical that simply changing ‘=’ to ‘!’ would really improve performance.

Using basic simulations we estimate this error rate imposes a 3-4% slow down on basic response time at a minimum. You heard me, 3-4% slow down at a minimum.

And these response issues get compounded in multi-threaded application server environments. When multiple pooled threads slow down to cope with error handling then the application has fewer resources available to process new requests. At the JVM level the errors cause more IO, more objects created in the heap, more garbage collection. In a multi-threaded environment the error threatens more than response times, it threatens load capacity.

And if the application is distributed across multiple JVM’s … well you get the picture.

The web contains many articles and discussions on the performance impact of Java’s exception handling. The examples in these posts provide sufficient data to show that a real performance penalty exists for throwing and catching exceptions. Doing the fix really does improve the performance!

Runtime Exceptions happen. When they occur frequently, they do appreciably slow down your application. The slowness becomes contagious to all transactions being served by the application. Don’t mute them. Don’t ignore them. Don’t dismiss them. Don’t convince yourself they are harmless. If you want a simple way to improve your applications performance, start by fixing up these regular occurring exceptions. Who knows, you just might help everyone’s code run a little faster.

Link to this post:

, , , , , , ,

<< Latest posts

Older posts >>