There are many technical articles/blogs on the web that jump straight into areas of .NET code you can instantly optimize and tune. Before we get to some of those areas, it’s good to take a step back and ask yourself, “Why am I here?” Are you interested in tuning your app, which is slow and keeps breaking, or are you looking to prevent these things from happening in the future? When you start down the path of Application Performance Management (APM), it is worth asking yourself another important question – what is success? This is especially important if you’re looking to tune or optimize your application. Knowing when to stop is as important as knowing when to start.
A single code or configuration change can have a dramatic impact on your application’s performance. It’s therefore important that you only change or tune what you need to – less is often more when it comes to improving application performance. I’ve been working with customers in APM for over a decade and it always amazes me how dev teams will browse through packages of code and rewrite several classes/methods at the same time with no real evidence that what they are changing will actually make an impact. For me, I learned the most about writing efficient code in code reviews with peers, despite how humbling it was. What I lacked the most as a developer, though, was visibility into how my code actually ran in a live production environment. Tuning in development and test is not enough if the application still runs slow in production. When manufacturers design and build cars they don’t just rely on simulation tests – they actually monitor their cars in the real world. They drive them for hundreds of thousands of miles to see how their cars will cope in all conditions they’ll encounter. It should be the same with application performance. You can’t simulate every use case or condition in dev and test, so you must understand your application performance in the real world.
With this in mind, here are a few tips for you:
Tip 1 – Understand how your Application runs in Production
Demand this visibility, because it’s the best data you can find for really understanding how your application performs and utilizes resources. By visibility, I mean the ability to physically see your application and how its performance (latency) is broken down across all its components and dependencies. You can’t manage what you can’t see; you must see the bigger picture as well as the small picture if you want to really understand application performance effectively.
For example, suppose you had the following performance metrics for .NET CLRs that related to your application in production:
Now imagine you had the following visualization of the application instead:This is the classic ops view of the world, giving you KPI metrics on how the infrastructure is performing. From this data one might assume that everything looks OK and healthy.
Two different perspectives of application performance tell you two very different stories. The key point here is that your application isn’t just made up of .NET CLR instances. It’s made up of other tiers like LDAP servers, database servers, message queues and remote web services. All these tiers can and will affect your application performance at some time, so visibility beyond the CLR is key. A high-level view like this means you can easily visualize application performance and make important decisions about where (and where not) to optimize and tune. Starting at a low level (e.g. class/method invocations) is often where most people go wrong because they can’t see the forest for trees and weeds.
Tip 2 – Know how Application Performance impacts your business
While it’s important to know how fast your code runs, it’s equally important to understand what impact your code has on business transactions and the users that invoke them. You only have finite time and resources which means you have to prioritize where and what you optimize to improve your application performance. You might find that a particular namespace, class or method is taking a few hundred milliseconds to execute in your test environment. However, if that code is barely invoked in production, you can tune it till the cows come home and it will still have a minimal impact on application performance and the business as a whole.
Breaking down your application’s performance by business transaction dramatically helps you prioritize your efforts. You want to spend your time optimizing the application components that matter. When you look at application performance through the lense of a profiler in dev, you see things from a CLR runtime perspective. I guarantee you will always find “interesting things” to tune and analyze, whether it’s blocks of bloated code or scary nested SQL statements. The bottom line is that without a business context you won’t know whether your efforts are in vain or serve a just cause. Why waste 80% of your time tuning something that will have a 0.5% impact on application performance in production? Business transaction context lets you focus on the right things regarding your application performance. For example, imagine you had the following view, which shows the performance of every business transaction in your production application:
- Errors being thrown for “Orders Queue” business transactions
- “Submit Orders” business transaction has several slow requests
So, before diving into any code, first prioritize what and where you’re going to optimize as well as the baseline you’re working against. If you can’t measure success, you can’t manage it.
Tip 3 – Does latency impact some or all business transactions?
While using average response times is good for isolating where the majority of time is spent in your application, the next step is to understand what makes up the average. Therefore, you need visibility into individual requests or executions of business transactions. For example, if 9 “Submit Order” transactions took 100ms and 1 “Submit Order” transaction took 10 seconds then the average response time would be 1.010 seconds. Relying on the average can therefore be misleading. Here is a view showing multiple executions of the “Submit Order” business transaction:
We see the lowest response time is 952m with the majority of transactions taking around 3 seconds. From this information, we can conclude that the “Submit Order” business transaction can benefit from tuning.
Tip 4 – Instrument Code Execution of Slow Business Transactions
Now you know what business transactions impact your application performance; the next step is to understand how its code actually executes. A key problem today is that application code/logic is often distributed and split across multiple CLRs, services and tiers. A single business transaction may start with some simple ASP.NET/MVC logic before making several remote SOAP, WCF or ADO.NET calls to other CLRs, services and tiers for data. Therefore, the only way to understand business transaction latency is to use an APM tool to get a breakdown of latency for each step in its journey. For example, here is a view that shows the call stack of a “Submit Order” business transaction as it executes within a CLR.
This type of information gives you great visibility into which namespaces, classes and methods are responsible for latency and poor application performance. We can see from the screenshot that the areas to optimize for this transaction are the two Windows Communication Framework (WCF) calls, which take 577ms, and 1872ms respectively. We can also ignore the other three WCF calls and the ADO.NET database call, which takes 31ms to execute. For remote distributed calls like WCF, it’s important to get visibility into how these services execute in their respective CLRs. For example, here is how the above 1872ms WCF call executed in its CLR:
We see that nearly all the latency spent in the WCF service call is actually related to a remote web service call (shown at the bottom of the call stack). As an application developer, this information is both good and bad. It’s good in the sense that no code changes need to be made locally in the application, and bad that changes or investigations need to be made by the service provider of the web service (which is especially bad if it’s a 3rd party service provider). This scenario is often common in SOA environments where applications share logic and rely on applications and services being provided by other teams or providers.
Tip 5 – Understand Data Access Latency
The majority of processing that takes place in your application will be done inside a database like SQL Server. You might not know this if you’re used to just invoking one or more ad-hoc queries. The reality is that business transactions either store or retrieve data. As data volumes increase over time, the latency associated with data retrieval increases. This is why DBAs have a full-time job ensuring that their databases are optimally configured and tuned. Round trips to database are expensive as they normally involve making a remote call along with retrieving data from the disk (which is slow). Controlling concurrency to the database is therefore key, which is why most applications use some form of connection pooling. The database is probably the last place you want contention or inefficiency, so be careful when adjusting your connection pool settings. Use of an APM solution is key to understanding how often and how long your application is accessing the database. For example, the below screenshot shows the latency of application code (ADO.NET) accessing the database, which takes 78ms.
Seeing the SQL text of the query is helpful, especially when you want to understand what the query is doing relative to the time its spending in the database. If you ever see a SELECT *, be sure to slap the person responsible with a wet fish. Also, watch out for statements with high query counts that hit the database multiple times per business transaction execution. For example, a query might take 5ms to execute, but if a business transaction invokes this query 500 times per execution, then there are 2.5 seconds of latency spent going backwards and forwards to the database. Performing a single database hit could reduce that 2.5 seconds by more than a factor of 10. The database is a precious resource, treat it like a genie and only ask questions when you need to.
Tip 6 – Resolve Exceptions with Business Context
It may be normal practice to ignore exceptions and errors, especially when it involves trawling through system events or log files. The reality is that unhandled exceptions create fog when “real” exceptions are thrown, and they can also cause CPU spikes depending on their size and frequency. It is also no secret that throwing exceptions can be an expensive operation. So the two important questions here are: how often are exceptions being thrown in the application, and where are they being thrown from? Using an APM tool it’s possible to get answers to these questions. For example, look at the following screenshot:
Here we see 127,335 exceptions were thrown during the last day. The next step is to find where these exceptions are being thrown. The below screenshot shows 1 of the 127,335 exceptions which was thrown. The details reveal that a connection was terminated while the business transaction “Orders Queue” was reading data.
From the business context provided, we can conclude that the application cannot read or process orders from the message queue. Business context is therefore critical to understanding the severity and business impact of exceptions in your application.
Tip 7 – Understand how your application consumes system resources
You have finite amount of resources that can service a finite amount of business transactions at any given time. As transaction concurrency increases in applications, so does the amount of resources that are used. This causes transaction throughput to increase until all the resources are exhausted. However, CLR configuration can often limit how an application can consume system resources, and thus has a direct impact on application throughput and performance. Being able to trend CLR and system resource metrics over time lets you understand the correlation between application performance and system resource utilization.
For example, the following screenshot compares the following application and system resource metrics:
- % CPU Utilization of Server
- Avg Response Time of Application
- # of Thread Contentions
- # of Physical CLR Threads
- % Garbage Collection Time Spent
- Processor Queue Length
At 2am, application response time was around 80ms (green) with CPU Utilization at 40% (red). Then, around 3am, thread contention (purple) started to occur in the application with response time increasing to over 200ms and CPU utilization hitting 80%, before returning to normal around 8am. We can conclude from this that the application is very sensitive to thread contention, which results in increased application response time and server CPU utilization. From this data you might choose to investigate which business transactions and code are accessing shared resources or using synchronization mechanisms. Reducing this could have a dramatic impact on response time and the CPU utilization of the application.
In summary, monitoring in production is critical to any effective APM strategy. While development and simulated testing can help you tune and iron out performance defects early on in the dev cycle, it simply cannot cover every use case or load scenario, which can often bite you in production. The secret to managing application performance is to see the bigger picture: you must use a business transaction context to focus your priorities, and analyze the application components you need to (code, sql, exceptions, clr config) in order to understand where latency occurs. With latency isolated, you’ll have a performance baseline against which to measure future results. You should then set a goal for when to stop optimizing. Once you’ve reached your goals and verified your fixes, you should then be able to verify the gains and impact your work had in production. APM is a constant process that can help application development and support teams ensure superior levels of application performance and availability.