The last couple articles presented an introduction to Application Performance Management (APM) and identified the challenges in effectively implementing an APM strategy. This article builds on these topics by reviewing five of the top performance metrics to capture to assess the health of your enterprise PHP application.
Specifically this article reviews the following:
1. Business Transactions
Business Transactions provide insight into real-user behavior: they capture real-time performance that real users are experiencing as they interact with your application. As mentioned in the previous article, measuring the performance of a business transaction involves capturing the response time of a business transaction holistically as well as measuring the response times of its constituent tiers. These response times can then be compared with the baseline that best meets your business needs to determine normalcy.
If you were to measure only a single aspect of your application I would encourage you to measure the behavior of your business transactions. While container metrics can provide a wealth of information and can help you determine when to auto-scale your environment, your business transactions determine the performance of your application. Instead of asking for the CPU usage of your application server you should be asking whether or not your users are able to complete their business transactions and if those business transactions are behaving normally.
As a little background, business transactions are identified by their entry-point, which is the interaction with your application that starts the business transaction. In the case of a PHP application, this is usually the HTTP request. There may be some exceptions, such as PHP CLI, in which case the business transaction could be a PHP script executed by a cron job from the command line. In the case of a PHP worker server, the business transaction could potentially be the job that the PHP application executes that it picked up from a queue server. Alternatively, you may choose to define multiple entry-points for the same web request based on a URL parameter or for a service call based on the contents of its body. The point is that the business transaction needs to be related to a function that means something to your business.
Once a business transaction is identified, its performance is measured across your entire application ecosystem. The performance of each individual business transaction is evaluated against its baseline to assess normalcy. For example, we might determine that if the response time of the business transaction is slower than two standard deviations from the average response time for this baseline that it is behaving abnormally, as shown in figure 1.
Figure 1 Evaluating BT Response Time Against its Baseline
The baseline used to evaluate the business transaction is evaluated is consistent for the hour in which the business transaction is running, but the business transaction is being refined by each business transaction execution. For example, if you have chosen a baseline that compares business transactions against the average response time for the hour of day and the day of the week, after the current hour is over, all business transactions executed in that hour will be incorporated into the baseline for next week. Through this mechanism an application can evolve over time without requiring the original baseline to be thrown away and rebuilt; you can consider it as a window moving over time.
In summary, business transactions are the most reflective measurement of the user experience so they are the most important metric to capture.
2. Internal Dependencies
Your PHP application may be utilizing a backend database, a caching layer, or possibly even a queue server as it offloads I/O intensive blocking tasks onto worker servers to process in the background. Whatever the backend your PHP application interfaces with, the latency to these backend services can affect the performance of your PHP application performance. The various types of internal exit calls may include:
In some environments, your PHP application may be interfacing with an obscure backend or messaging/queue server. For example, you may have an old message broker serving as an interface between your PHP application and other applications. While this message broker may be outdated, it is nevertheless part of an older architecture and is part of the ecosystem in which your distributed applications communicate with. Even if your APM solution does not autodiscover the messaging server, you may build additional instrumentation to detect the messaging server and measure the latency. If you want to draw correlation so the business transaction does not become fragmented, you may also build additional instrumentation so that the correlation is autocompleted when the request passes through the messaging server to the applications on the other side of the queue.
From a business transaction perspective, we can identify and measure internal dependencies as being in their own tiers. Sometimes we need to configure the monitoring solution to identify methods that really wrap external service calls, but for common protocols, such as HTTP and CLI, internal dependencies can be automatically detected. Similar to business transactions and their constituent application tiers, external dependency behavior should be baselined and response times evaluated against those baselines.
However your PHP application communicates with internal services, the latency in waiting for the response can potentially impact the performance of your application and your customer experience. Measuring and optimizing the response time if your communications can help solve for these bottlenecks.
3. External Calls
In addition to your internal services and backends, exit calls may also include remote third-party web service APIs that your application makes calls to in real-time. For example, if your customer is attempting to purchase items in a shopping cart and in order for the transaction to complete your application must charge their credit card so that you can display a confirmation or error page. This is an example of a blocking exit call because your entire transaction is now dependent on the response of that call being made to your third-party merchant provider that is charging the credit card. If the customers credit card was charged successfully, the user is presented with a confirmation page and a receipt. If the credit card was declined, the user is presented with an error message to try again. Regardless, the customer is waiting on the application that is dependent on the third party merchant provider. The latency of the exit call will have an immediate impact on the performance of this particular instance.
External dependencies can come in various forms and are systems with which your application interacts. We do not necessarily have control over the code running inside external dependencies, but we often have control over the configuration of those external dependencies, so it is important to know when they are running well and when they are not. Furthermore, we need to be able to differentiate between problems in our application and problems in dependencies.
Business transactions provide you with the best holistic view of the performance of your application and can help you triage performance issues, but external dependencies can significantly affect your applications in unexpected ways unless you are watching them.
4. Caching Strategy
It is always faster to serve an object from memory than it is to make a network call to retrieve the object from a system like a database; caches provide a mechanism for storing object instances locally to avoid this network round trip. But caches can present their own performance challenges if they are not properly configured. Common caching problems include:
Loading too much data into the cache
Not properly sizing the cache
When measuring the performance of a cache, you need to identify the number of objects loaded into the cache and then track the percentage of those objects that are being used. The key metrics to look at are the cache hit ratio and the number of objects that are being ejected from the cache. The cache hit count, or hit ratio, reports the number of object requests that are served from cache rather than requiring a network trip to retrieve the object. If the cache is huge, the hit ratio is tiny (under 10% or 20%), and you are not seeing many objects ejected from the cache then this is an indicator that you are loading too much data into the cache. In other words, your cache is large enough that it is not thrashing (see below) and contains a lot of data that is not being used.
The other aspect to consider when measuring cache performance is the cache size. Is the cache too large, as in the previous example? Is the cache too small? Or is the cache sized appropriately?
A common problem when sizing a cache is not properly anticipating user behavior and how the cache will be used. Let’s consider an APC cache configured to use 32M of memory. If 32M is not enough for APC to be used as an opcode cache, then you run the risk of exchanging data in memory for data on the disk. Considering disk I/O is much slower than reading from memory, this defeats the purpose of your in-memory cache and significantly slows down your application performance. The result is that we’re spending more time managing the cache rather than serving objects: in this scenario the cache is actually getting in the way rather than improving performance.
When you size a cache too small and the aforementioned behavior occurs, we say that the cache is thrashing and in this scenario it is almost better to have no cache than a thrashing cache. Figure 2 attempts to show this graphically.
Figure 2 Cache Thrashing
In this situation, the application requests an object from the cache, but the object is not found. It then queries the external resource across the network for the object and adds it to the cache. Finally, the cache is full so it needs to choose an object to eject from the cache to make room for the new object and then add the new object to the cache.
5. Application Topology
The final performance component to measure in this top-5 list is your application topology. Because of the advent of the cloud, applications can now be elastic in nature: your application environment can grow and shrink to meet your user demand. Therefore, it is important to take an inventory of your application topology to determine whether or not your environment is sized optimally. If you have too many virtual server instances then your cloud-hosting cost is going to go up, but if you do not have enough then your business transactions are going to suffer.
It is important to measure two metrics during this assessment:
Business Transaction Load
Business transactions should be baselined and you should know at any given time the number of servers needed to satisfy your baseline. If your business transaction load increases unexpectedly, such as to more than two times the standard deviation of normal load then you may want to add additional servers to satisfy those users.
The other metric to measure is the performance of your containers. Specifically you want to determine if any tiers of servers are under duress and, if they are, you may want to add additional servers to that tier. It is important to look at the servers across a tier because an individual server may be under duress due to factors like garbage collection, but if a large percentage of servers in a tier are under duress then it may indicate that the tier cannot support the load it is receiving.
Because your application components can scale individually, it is important to analyze the performance of each application component and adjust your topology accordingly.
This article presented a top-5 list of metrics that you might want to measure when assessing the health of your application. In summary, those top-5 items were:
In the next article we’re going to pull all of the topics in this series together to present the approach that AppDynamics took to implementing its APM strategy. This is not a marketing article, but rather an explanation of why certain decisions and optimizations were made and how they can provide you with a powerful view of the health of a virtual or cloud-based application.