Reduce time to detect with AppDynamics Cloud Log Analytics

April 03 2023
 

How machine learning in AppDynamics Cloud accelerates log analysis and reduces mean time to detect.


Site recovery engineers (SREs) need to investigate unknown problems reported in production. The common approach is to search and filter log files to find the root cause, and we all know how painful it is to sift through log contents. It’s like finding a needle in a haystack. A machine learning approach is essential to assist SREs to quickly identify the root cause.

Searching millions of log contents is tedious and time consuming. Typical SREs apply their past experience to compose search criteria and usually start by looking at the most severe errors in the log file leading toward the reported unknown problem. However, this will miss the obvious signal indicating the software hung: no more output of “Info” level logs. Finding log patterns and identifying abnormalities in the patterns can be easily handled by machine learning algorithms.

In this blog, we will discuss how machine learning can accelerate log analysis to reduce SRE time to detect.

Quick refresher: why applications generate logs

Before we dive into where machine learning can be applied to solve log analysis problems, let’s not forget why applications generate logs in the first place. Logs record what the application does while it is running, which are used by SREs for post-mortem analysis. An SRE uses the information in the log files to reconstruct what happened and why the application did not perform as expected. Although log file format varies from application to application, at minimum, it contains three fields: timestamp, log level and log content. Given the high frequency of log output, an automated tool should be used to analyze these log contents to discover log content patterns and log level distribution.

How AppDynamics Cloud log analytics save time

Fig. 1 shows logs for the specific Service Instance, and above some bars, there is an indication of outliers.

Coarse-grain noise reduction with content pattern detection

When the application generates terabytes of log content daily, it takes time to index and search that content. When an application is running at a steady state, its log content patterns also reach a steady state: no more new patterns are detected. At this stage, new log content still writes to disk at TB per day. But, there is no new pattern generated. We have observed that six million log entries were reduced to four hundred patterns. The log pattern machine learning algorithm helps to significantly shrink the log content to a manageable amount of information.

Fig 2 shows the Logs Exploration page, where the histogram shows outliers for quicker orientation when something unusual starts to happen; the particular log messages are grouped into ranked patterns.

Fine-grain noise reduction with log-level distribution

As mentioned above, the log file also contains the timestamp and log level. When an application is running as expected, its log-level distribution mainly shows info/trace/debug severity. Once in a while, it also shows warning or error messages, which should not trigger any alert due to the nature of self-healing built into the application. The machine learning algorithm treats these types of warning or error messages as noise. When the log outputs deviate from the normal state, AppDynamics Cloud will apply both change point detection and anomaly detection to narrow down the problematic time frame and correlate to the patterns. Then, it will rank the log patterns by relevant weight value. The ranking machine learning algorithm takes the guesswork out of log search and presents the log patterns in the priority order.

Fig 3. Shows that to investigate the pattern’s details, the user clicks on the row. They will be redirected to the second tab to see raw log messages that belong to the selected pattern. The other exploration tools are available as filtering the logs, seeing details of each log message, adding columns to the table etc.

AppDynamics Cloud is built by Cisco AppDynamics engineers who understand the observability domain problems and enable teams to save time by leveraging modern machine learning algorithms.

See firsthand how SREs can save time finding relevant log content with a demo of  AppDynamics Cloud.

Linda Zhou is the Director of Data Science Engineering at Cisco AppDynamics. Her team is responsible for delivering machine learning capabilities and MLOps infrastructure for full-stack observability. She has in-depth knowledge of machine learning, observability, big data analytics, IT service management and compliance archiving. Prior to joining Cisco, she held business and technical positions at Western Digital, Silicon Graphics, EMC, Hewlett Packard and BMC Software, and ran a development services company in the data management space.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form