TAG | BTM

Last week I flew into Las Vegas for #Interop fully suited and booted in my big blue costume (no joke). I’d been invited to speak in a vendor debate on User eXperience (UX): Monitor the Application or the Network? NetScout represented the Network, AppDynamics (and me) represented the Application, and “Compuware dynaTrace Gomez” sat on the fence representing both. Moderating was Jim Frey from EMA, who did a great job introducing the subject, asking the questions and keeping the debate flowing.

At the start each vendor gave their usual intro and company pitch, followed by their own definition on what User Experience is.

Defining User Experience

So at this point you’d probably expect me to blabber on about how application code and agents are critical for monitoring the UX? Wrong. For me, users experience “Business Transactions”–they don’t experience applications, infrastructure, or networks. When a user complains, they normally say something like “I can’t Login” or “My checkout timed out.” I can honestly say I’ve never heard them say –  ”The CPU utilization on your machine is too high” or “I don’t think you have enough memory allocated.”

Now think about that from a monitoring perspective. Do most organizations today monitor business transactions? Or do they monitor application infrastructure and networks? The truth is the latter, normally with several toolsets. So the question “Monitor the Application or the Network?” is really the wrong question for me. Unless you monitor business transactions, you are never going to understand what your end users actually experience.

Monitoring Business Transactions

So how do you monitor business transactions? The reality is that both Application and Network monitoring tools are capable, but most solutions have been designed not to–just so they provide a more technical view for application developers and network engineers. This is wrong, very wrong and a primary reason why IT never sees what the end user sees or complains about. Today, SOA means applications are more complex and distributed, meaning a single business transaction could traverse multiple applications that potentially share services and infrastructure. If your monitoring solution doesn’t have business transaction context, you’re basically blind to how application infrastructure is impacting your UX.

The debate then switched to how monitoring the UX differs from an application and network perspective. Simply put, application monitoring relies on agents, while network monitoring relies on sniffing network traffic passively. My point here was that you can either monitor user experience with the network or you can manage it with the application. For example, with network monitoring you only see business transactions and the application infrastructure, because you’re monitoring at the network layer. In contrast, with application monitoring you see business transactions, application infrastructure, and the application logic (hence why it’s called application monitoring).

Monitor or Manage the UX?

Both application and network monitoring can identify and isolate UX degradation, because they see how a business transaction executes across the application infrastructure. However, you can only manage UX if you can understand what’s causing the degradation. To do this you need deep visibility into the application run-time and logic (code). Operations telling a Development team that their JVM is responsible for a user experience issue is a bit like Fedex telling a customer their package is lost somewhere in Alaska. Identifying and Isolating pain is useful, but one could argue it’s pointless without being able to manage and resolve the pain (through finding the root cause).

Netscout made the point that with network monitoring you can identify common bottlenecks in the network that are responsible for degrading the UX. I have no doubt you could, but if you look at the most common reason for UX issues, it’s related to change–and if you look at what changes the most, it’s application logic. Why? Because Development and Operations teams want to be agile, so their applications and business remains competitive in the marketplace. Agile release cycles means application logic (code) constantly changes. It’s therefore not unusual for an application to change several times a week, and that’s before you count hotfixes and patches. So if applications change more than the network, then one could argue it’s more effective for monitoring and managing the end user experience.

UX and Web Applications

We then debated which monitoring concept was better for web-based applications. Obviously, network monitoring is able to monitor the UX by sniffing HTTP packets passively, so it’s possible to get granular visibility on QoS in the network and application. However, the recent adoption of Web 2.0 technologies (ajax, GWT, Dojo) means application logic is now moving from the application server to the users browser. This means browser processing time becomes a critical part of the UX. Unfortunately, Network monitoring solutions can’t monitor browser processing latency (because they monitor the network), unlike application monitoring solutions that can use techniques like client-side instrumentation or web-page injection to obtain browser latency for the UX.

The C Word

We then got to the Cloud and which made more sense for monitoring UX. Well, network monitoring solutions are normally hardware appliances which plug direct into a network tap or span port. I’ve never asked, but I’d imagine the guys in Seattle (Amazon) and Redmond (Windows Azure) probably wouldn’t let you wheel a network monitoring appliance into their data-centre. More importantly, why would you need to if you’re already paying someone else to manage your infrastructure and network for you? Moving to the Cloud is about agility, and letting someone else deal with the hardware and pipes so you can focus on making your application and business competitive. It’s actually very easy for application monitoring solutions to monitor UX in the cloud. Agents can piggy back with application code libraries when they’re deployed to the cloud, or cloud providers can embed and provision vendor agents as part of their server builds and provisioning process.

What’s interesting also is that Cloud is highlighting a trend towards DevOps (or NoOps for a few organizations) where Operations become more focused on applications vs infrastructure. As the network and infrastructure becomes abstracted in the Public Cloud, then the focus naturally shifts to the application and deployment of code. For private clouds you’ll still have network Ops and Engineering teams that build and support the Cloud platform, but they wouldn’t be the people who care about user experience. Those people would be the Line of Business or application owners which the UX impacts.

In reality most organizations today already monitor the application infrastructure and network. However, if you want to start monitoring the true UX, you should monitor what your users experience, and that is business transactions. If you can’t see your users’ business transactions, you can’t manage their experience.

What are your thoughts on this?

AppDynamics is an application monitoring solution that helps you monitor business transactions and manage the true user experience. To get started sign-up for a 30-day free trial here.

I did have an hour spare at #Interop after my debate to meet and greet our competitors, before flying back to AppDynamics HQ. It was nice to see many of them meet and greet the APM Caped Crusader.

App Man.

Link to this post:

, , , , , , , , , , , , ,

The most enjoyable part of my job at AppDynamics is to witness and evangelize customer success. What’s slightly strange is that for this to happen, an application has to slow down or crash.

It’s a bittersweet feeling when End Users, Operations, Developers and many Businesses suffer application performance pain. Outages cost the business money, but sometimes they cost people their jobs–which is truly unfortunate. However, when people solve performance issues, they become overnight heroes with a great sense of achievement, pride, and obviously relief.

To explain the complexity of managing application performance, imagine your application is 100 haystacks that represent tiers, and somewhere a needle is hurting your end user experience. It’s your job to find the needle as quickly as possible! The problem is, each haystack has over half a million pieces of hay, and they each represent lines of code in your application. It’s therefore no surprise that organizations can take days or weeks to find the root cause of performance issues in large, complex, distributed production environments.

End User Experience Monitoring, Application Mapping and Transaction profiling will help you identify unhappy users, slow business transactions, and problematic haystacks (tiers) in your application, but they won’t find needles. To do this, you’ll need x-ray visibility inside haystacks to see which pieces of hay (lines of code) are holding the needle (root cause) that is hurting your end users. This X-Ray visibility is known as “Deep Diagnostics” in application monitoring terms, and it represents the difference between isolating performance issues and resolving them.

For example, AppDynamics has great End User Monitoring, Business Transaction Monitoring, Application Flow Maps and very cool analytics all integrated into a single product. They all look and sound great (honestly they do), but they only identify and isolate performance issues to an application tier. This is largely what Business Transaction Management (BTM) and Network Performance Management (NPM) solutions do today. They’ll tell you what and where a business transaction slows down, but they won’t tell you the root cause so you can resolve the issues.

Why Deep Diagnostics for Production Monitoring Matters

A key reason why AppDynamics has become very successful in just a few years is because our Deep Diagnostics, behavioral learning, and analytics technology is 18 months ahead of the nearest vendor. A bold claim? Perhaps, but it’s backed up by bold customer case studies such as Edmunds.com and Karavel, who compared us against some of the top vendors in the application performance management (APM) market in 2011. Yes, End User Monitoring, Application Mapping and Transaction Profiling are important–but these capabilities will only help you isolate performance pain, not resolve it.

AppDynamics has the ability to instantly show the complete code execution and timing of slow user requests or business transactions for any Java or .NET application, in production, with incredibly small overhead and no configuration. We basically give customers a metal detector and X-Ray vision to help them find needles in haystacks. Locating the exact line of code responsible for a performance issue means Operations and Developers solve business pain faster, and this is a key reason why AppDynamics technology is disrupting the market.

Below is a small collection of needles that customers found using AppDynamics in production. The simple fact is that complete code visibility allows customers to troubleshoot in minutes as opposed to days and weeks. Monitoring with blind spots and configuring instrumentation are a thing of the past with AppDynamics.

Needle #1 – Slow SQL Statement

Industry: Education
Pain: Key Business Transaction with 5 sec response times
Root Cause: Slow JDBC query with full-table scan

Needle #2 – Slice of Death in Cassandra

Industry: SaaS Provider
Pain: Key Business Transaction with 2.5 sec response times
Root Cause: Slow Thrift query in Cassandra

Needle #3 – Slow & Chatty Web Service Calls

Industry: Media
Pain: Several Business Transactions with 2.5 min response times
Root Cause: Excessive Web Service Invocation (5+ per trx)

Needle #4 -Extreme XML processing

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 17 sec response times
Root Cause: XML serialization over the wire.

Needle #5 – Mail Server Connectivity

Industry: Retail/E-Commerce
Pain: Key Business Transaction with 20 sec response times
Root Cause: Slow Mail Server Connectivity

 Needle #6 – Slow ResultSet Iteration

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 30+ sec response times
Root Cause: Querying too much data

Needle #7 – Slow Security 3rd Party Framework

Industry: Education
Pain: All Business Transactions with > 3 sec response times
Root Cause: Slow 3rd party code

Needle #8 – Excessive SQL Queries

Industry: Education
Pain: Key Business Transactions with 2 min response times
Root Cause: Thousands of SQL queries per transaction

Needle #9 – Commit Happy

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 25+ sec response times
Root Cause: Unnecessary use of commits and transaction management.

Needle #10 – Locking under Concurrency

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 5+ sec response times
Root Cause: Non-Thread safe cache forces locking for read/write consistency

 Needle #11 – Slow 3rd Party Search Service

Industry: SaaS Provider
Pain: Key Business Transaction with 2+ min response times
Root Cause: Slow 3rd Party code

 Needle #12 – Connection Pool Exhaustion

Industry: Financial Services
Pain: Several Business Transactions with 7+ sec response times
Root Cause: DB Connection Pool Exhaustion caused by excessive connection pool invocation & queries

Needle #13 – Excessive Cache Usage

Industry: Retail/E-Commerce
Pain: Several Business Transactions with 50+ sec response times
Root Cause: Cache Sizing & Configuration

If you want to manage and troubleshoot application performance in production, you should seriously consider AppDynamics. We’re the fastest growing on-premise and SaaS based APM vendor in the market right now. You can download our free product AppDynamics Lite or take a free 30-day trial of AppDynamics Pro – our commercial product.

Now go find those needles that are hurting your end users!

App Man.

Link to this post:

, , , , , , , , , , , , , , , , ,

App Man

Code Deadlock – A Usual Suspect

Imagine you’re an operations guy and you’ve just received a phone call or alert notifying you that the application your responsible for is running slow. You bring up your console, check all related processes, and notice one java.exe process isn’t using any CPU but the other Java processes are.  The average sys admin at this point would just kill and restart the Java process, cross their fingers, and hope everything returns back to normal (this actually does work most of the time). An experienced sys admin might perform a kill -3 on the Java process, capture a thread dump, and pass this back to dev for analysis. Now suppose your application returns back to normal–end users stop complaining, you pat yourself on the back and beat your chest, and basically resume what you were doing before you were rudely interrupted.

The story I’ve just told may seem contrived, but I’ve witnessed it several times with customers over the years. The stark reality is that no one in operations has the time or visibility to figure out the real business impact behind issues like this. Therefore, little pressure is applied to development to investigate data like thread dumps so that root causes can be found and production slowdowns can be avoided again in future. It’s true restarting a JVM or CLR will solve a fair few issues in production, but it’s only a temporary fix over the real problems that exist within the application logic and configuration.

Now imagine for one minute that operations could actually figure out the business impact of production issues, along with identifying the root cause, and communicate this information to Dev so problems could be fixed rapidly. Sounds too good to be true, right? Well, a few weeks ago an AppDynamics customer did just that and the story they told was quite compelling.

Code Deadlock in a distributed E-Commerce Application

The customer application in question was a busy e-commerce retail website in the US. The architecture was heavily distributed with several hundred application tiers that included JVMs, LDAP servers, CMS server, message queues, databases and 3rd party web services. Here is a quick glimpse of what that architecture looked like from a high level:

Detecting Code Deadlock

If we look at the AppDynamics problem pane (right) as the customer saw things, it shows the severity of their issues. During the day the application was experiencing just over 4,000 business transactions per minute, which works out at just under 1 million transactions a day. Approximately 2.5% of these transactions were impacted by the slowdown, which was the result of the 92 code deadlocks you see here that occurred during peak hours.

AppDynamics is able to dynamically baseline the performance of every business transaction type before classifying each execution as normal, slow, very slow or stalled depending on its deviation from its unique performance baseline. This is critical for understanding the true business impact of every issue or slowdown because operations can immediately see how many user requests were impacted relative to the total requests being processed by the application.

From this pane, operations were able to drill down into the 92 code deadlocks and see the events that took place as each code deadlock occurred. As you can see from the screenshot (below left), the sys admins during the slowdown kept restarting the JVMs (as shown) to try and make the issues go away. Unfortunately, this didn’t work given that the application was experiencing high concurrency under peak load.

By drilling into each Code Deadlock event, operations were able to analyze the various thread contentions and locate the root cause of the issue. The root cause of the slowdown turned out to be an application cache which wasn’t thread-safe. If you look at the screenshot below, showing the final execution of the threads in deadlock accessing the cache, you can see that one thread was trying to remove an item, another was trying to get an item, and the last thread was trying to put an item. 3 threads were trying to do a put, get and remove at the same time! This caused a deadlock to occur on cache access, thus causing the related JVM to hang until those threads were released via a restart.

 Analyzing Thread Dumps

Below you can see the thread dump that AppDynamics collected for one of the code deadlocks, which clearly shows where each thread was deadlocked. By copying the full thread dumps to clipboard, operations were able to see the full stack trace of each thread, thus identifying which business transactions, classes, and methods were responsible for cache access.

The root cause for this production slowdown may have been identified and passed to dev for resolution, but the most compelling conclusion from this customer story was related to them identifying the real business impact that occurred. The application was clearly running slow, but what did the end user experience during the slowdown and what impact would this have had on the business?

What was the Actual Business Impact?

The screenshot below shows all business transactions that were executing on the e-commerce web application during the five hour window before, during, and after the slowdown occurred.

Here are some hard hitting facts for the two most important business transactions inside this e-commerce application:

  • 46,463 Checkouts processed
    • 482 returned an error, 1325 were slow, 576 were very slow and 111 stalled.
  • 3,956 Payments processed
    • 12 returned an error, 242 were slow, 96 were very slow and 79 stalled

Error – transaction failed with an exception. Slow – the business transaction deviated from its baseline by more than 3 standard deviations. Very Slow – the business transaction deviated from its baseline by more than 4 standard deviations. Stalled – the transaction timed out.

If you take these raw facts and assume the average revenue per order is $100, then the potential revenue risk/impact of this slowdown was easily into six digits when you consider the end user experience for checkout and payment. Even if you take the 482 Errors and 111 Stalls relating to the Checkout business transaction alone – this still equates to around $60,000 of revenue at risk. And that’s a fairly conservative estimate!

If you add up all the errors, slow, very slow and stalls you see in the screenshot above, you start to picture how serious this issue was in production. The harsh reality is that incidents like this happen everyday in production environments, but no one has visibility into the true business impact of them, meaning little pressure is applied to development to fix “glitches.”

Agile isn’t about Change, It’s about Results

If development teams want to be truly agile, they need to forget about constant change and focus on what impact their releases has on the business. The next time your application slows down or crashes in production, ask yourself one question: “What impact did that just have on the business?” I guarantee just thinking about that answer will make you feel cold. If development teams found out more often the real business impact of their work, they’d learn pretty quickly how fast, reliable and robust their application code really is.

I’m pleased to say no developers were injured or fired during the making of this real-life customer story; they were simply educated on what impact their non-thread safe cache had on the business. Failure is OK–that’s how we learn and build better applications.

App Man.

Link to this post:

, , , , , , , ,

We recently finished conducting our annual Application Performance Management survey. Over 250 IT professionals participated, and they shared insights such as:
- Many Ops and Dev teams are anticipating growth in their applications by 20% or more
- Over 50% are planning to move to the cloud, and are architecting brand-new applications to be cloud-ready
- Most teams are using log files to monitor application performance, rather than an Application Performance Management (APM) tool.

We’ll release the full report soon, but here’s an infographic that summarizes some of the main findings:

AppDynamics Inforgraphic - Storm Clouds in 2012

Embed this image on your site:

What I found personally surprising was the heavy reliance on log files. When you’re troubleshooting distributed architectures, time is of the essence–and there’s no way to cut your MTTR down when you’re relying on log files to identify root cause.

In fact, there’s only one guy who ever made using a log file look cool:

And I think we can all agree that’s a pretty unique use case.

We’ll have the full survey results available soon.

 

 

Link to this post:

, , , , , , , , , , , , , , , , , , , ,

People in our industry always talk about IT complexity and cost. Cost is pretty easy to calculate, because IT budgets are allocated and audited every year. Complexity is very different–we know it exists, but we can’t really see or measure it. Complexity is often when our brain tries to understand something and stalls in the process, trying to make sense of information that has never been seen before.

Well, this happened to a few of us in AppDynamics last week. A customer was kind enough to share how a single login business transaction flowed across their entire infrastructure. You might be thinking: “How can a login transaction be complex? That’s just a simple call to an LDAP or SiteMinder tier”–which is pretty much what we all thought it was. However, the screenshot that graced us was one of shock, beauty and amazement. In fact, I’m looking at it right now before I scrub the customer details, and I’m still thinking “Hmmmm, this is bonkers.”

Without delaying further, here is that very screenshot showing the Login Business Transaction:

Scary huh? What you see is the flow and timing of a Customer Login business transaction as it executes across a well governed, regulated, SOA environment consisting of many services (denoted by the Java Tiers). The Customer Login transaction begins at the Java node to the right marked “START” and propagates across the entire SOA environment using a combination of sync/async JMS messages, HTTP and RMI communication to notify other Services that a customer is now active and logged in. You can also see many services writing to a database as a result of this transaction. These invocations are simply auditing the customer login to satisfy the legal regulations that this organization has to comply with. So if you ever wonder what impact Governance and Legislation has on IT, this is a perfect example of the complexity storm it creates. What’s interesting is that the Logout business transaction for this application was just as complex!

The screenshot above unfortunately reflects the enormous complexity that many IT departments have to deal with everyday, especially when a user complains that their business transaction is slow. The problem for 95% of IT departments is they don’t have this type of visibility in production. They can feel pain, but they can’t see it. A slow business transaction may take 25 seconds to complete and touch many infrastructure tiers along the way. Unless IT sees this end to end journey they’ll always struggle to troubleshoot and manage it.

The good news is you’re 30 minutes away from getting this visibility in production by evaluating a next generation application monitoring solution like AppDynamics Pro. AppDynamics will auto-discover your business transactions, map their specific flows across your infrastructure, and give you a latency breakdown across and inside every tier the business transaction touches.

To manage and master IT complexity you have to visualize and see it.  Seeing how your business actually runs across IT is completely different to guessing how your business runs across IT. Next time a user complains that their business transaction is slow, what will you do? Bury your head in a log file, or visualize how that business transaction executed using an application performance monitoring solution like AppDynamics?

Isn’t it about time you mapped your app?

App Man.

Link to this post:

, , , , , , , , , , , ,

Peter Drucker proclaimed: “If you can’t measure it, you can’t manage it.” Do you know what’s “normal” for your mission-critical application? Actually, wait a second–with Halloween having just finished up,  maybe the following Young Frankenstein reference is more appropriate. Whenever I focus on the word “normal,” the first thing that pops into my head (pardon the pun) is that famous scene from Young Frankenstein:

DR. FREDERICK FRANKENSTEIN: Abby Normal?

IGOR: I’m almost sure that was the name.

DR. FREDERICK FRANKENSTEIN: [chuckles] Are you saying that I put an abnormal brain into a seven and a half foot long, fifty-four inch wide GORILLA?

[grabs Igor and starts throttling him]

DR. FREDERICK FRANKENSTEIN: Is that what you’re telling me?

Read the Full Post…

Link to this post:

, , , , , , , , ,

Hugh Brien

Software that “Just Works”

This is my first blog.  I’ve been a sales engineer for three application performance management (APM) products over the last 7 years (CA Wily Introscope, SpringSource/Hyperic, and now AppDynamics). I hadn’t really considered myself much of a “blogger” because I have alway thought that actions speak louder than words.  So I guess you are wondering why I would start now.  I guess you could say I was inspired by a recent experience at a customer site. It was quite a bit different than what I’ve been used to in my earlier APM career.

We recently had an experience with a customer who called and inquired about AppDynamics for monitoring several of their mission critical applications. As always our sales team kicked into high gear and had a conversation with the customer.  In less than 15 minutes we  agreed to start a Proof of Concept for AppDynamics running on a critical application.  Later that day, we setup an online conference with the customer and commenced an installation that took about 15 minutes.

Read the Full Post…

Link to this post:

, , , , , , , ,

The majority of us in IT are specialists, with the exception of a few VPs of engineering who are “special” in their own “special” world of being “special.” What I mean by this is that no single person has the skills or experience to do everything well in IT. IT is too big for me to explain or summarize in a few words, other than it requires a lot of different people with different skills to make it tick along. Despite applications being the living breathing entities of the business, a large portion of folk in IT have little context of how applications are built, how they execute, and how they consume resource across the IT infrastructure. Many people simply don’t care as their responsibilities are completely void of anything application related. That’s fine–but the reality is that everyone in IT should have one eye on the business. The whole reason IT exists is so the business can be more competitive and make more money. If this happens, IT gets more budget and is allowed to innovate more. IT and the business need each other to survive, which is why when applications slow down or break, both parties bitch at each other.

Operations need better visibility

Unfortunately for both the business and IT, the people (Operations) who manage the performance and availability of applications in production aren’t application experts. They are also not stupid either; their skills sets are wide and broad across many technologies and platforms that underpin applications. They manage a lot of things that application developers take for granted, like networks, databases, storage and virtualization. While Operations monitor the health of these infrastructure components, they often get bombarded with crap from the business when end users and business transactions are being impacted by slow performance, despite all system monitoring showing everything is fine. This lack of understanding between the Business and Operations is because both parties see things from different perspectives.

Read the Full Post…

Link to this post:

, , , , , , , , , , , , , ,

In May last year we launched AppDynamics Lite 1.0, the first free application performance management (APM) solution to monitor and troubleshoot a production JVM. 18 Months and 50,000+ users later I’m pleased to announce version 2 of AppDynamics Lite is here and the innovation hasn’t stopped. In fact, I would say Lite 2.0 gives many legacy (or vintage) APM vendors a run for their money. Five years ago the standard for any APM solution was the ability to perform Byte Code Instrumentation for logging application response times along with some JMX metric collection to monitor JVM and container resource. Lite 2.0 does all this, albeit it’s limited to a single JVM, which is one more JVM than any other APM vendor will give you for free.

Here is an example of what AppDynamics Lite 2.0 looks like today when you monitor a single JVM.

Read the Full Post…

Link to this post:

, , , , , , , , , , , ,

Steve Waterworth

Agent Intelligence

How intelligent is your monitoring agent?

The agent should not do too much processing locally to ensure minimal impact to application performance by utilizing the smallest CPU and memory footprint possible. On the other hand, offloading some processing to the agent results in less network traffic and more scalability from the monitoring Mgmt Server.

Read the Full Post…

Link to this post:

, , , , , , , , , , , , , ,

Older posts >>