Site recovery engineers (SREs) need to investigate unknown problems reported in production. The common approach is to search and filter log files to find the root cause, and we all know how painful it is to sift through log contents. It’s like finding a needle in a haystack. A machine learning approach is essential to assist SREs to quickly identify the root cause.
Searching millions of log contents is tedious and time consuming. Typical SREs apply their past experience to compose search criteria and usually start by looking at the most severe errors in the log file leading toward the reported unknown problem. However, this will miss the obvious signal indicating the software hung: no more output of “Info” level logs. Finding log patterns and identifying abnormalities in the patterns can be easily handled by machine learning algorithms.
In this blog, we will discuss how machine learning can accelerate log analysis to reduce SRE time to detect.
Quick refresher: why applications generate logs
Before we dive into where machine learning can be applied to solve log analysis problems, let’s not forget why applications generate logs in the first place. Logs record what the application does while it is running, which are used by SREs for post-mortem analysis. An SRE uses the information in the log files to reconstruct what happened and why the application did not perform as expected. Although log file format varies from application to application, at minimum, it contains three fields: timestamp, log level and log content. Given the high frequency of log output, an automated tool should be used to analyze these log contents to discover log content patterns and log level distribution.
How AppDynamics Cloud log analytics save time
Coarse-grain noise reduction with content pattern detection
When the application generates terabytes of log content daily, it takes time to index and search that content. When an application is running at a steady state, its log content patterns also reach a steady state: no more new patterns are detected. At this stage, new log content still writes to disk at TB per day. But, there is no new pattern generated. We have observed that six million log entries were reduced to four hundred patterns. The log pattern machine learning algorithm helps to significantly shrink the log content to a manageable amount of information.
Fine-grain noise reduction with log-level distribution
As mentioned above, the log file also contains the timestamp and log level. When an application is running as expected, its log-level distribution mainly shows info/trace/debug severity. Once in a while, it also shows warning or error messages, which should not trigger any alert due to the nature of self-healing built into the application. The machine learning algorithm treats these types of warning or error messages as noise. When the log outputs deviate from the normal state, AppDynamics Cloud will apply both change point detection and anomaly detection to narrow down the problematic time frame and correlate to the patterns. Then, it will rank the log patterns by relevant weight value. The ranking machine learning algorithm takes the guesswork out of log search and presents the log patterns in the priority order.
AppDynamics Cloud is built by Cisco AppDynamics engineers who understand the observability domain problems and enable teams to save time by leveraging modern machine learning algorithms.
See firsthand how SREs can save time finding relevant log content with a demo of AppDynamics Cloud.