I’ve got many years of performance geekery under my belt and I’ve learned many lessons during that time. One of the most important lessons is paying close attention to the distinction between data and information. Let’s take a look at how the dictionary defines each term:
Data – facts and statistics collected together for reference or analysis.
Information – facts provided or learned about something or someone.
What do these definitions reveal to us about data and information and how does it apply to monitoring tools? Let’s explore that together. I’ll provide specific examples along the way to illustrate my points.
The Problems with Data
Data is fundamental to problem solving, but I don’t want to have to dig through a bunch of data while my business critical, mission critical, revenue generating, etc… applications are down. To me, data is just like this picture…
Problem #1: That giant steaming pile is just like a bunch of data collected by your monitoring tools. Somewhere in that pile lies a golden ring which is the answer to your application problem. The last thing I want to do is get my hip waders on and start manually digging through that stinky mess. Instead I’d prefer to hit the pile with a fire hose and wash away all the crap so I can be shown the exact location of my shiny golden ring.
Problem #2: I don’t want to store every steaming pile I come across just in case there is a golden ring buried within. I’d much rather have a way of detecting something of interest contained within the pile before I transport and take up valuable storage space with it.
Problem #3: Nobody wants to look at or even get a whiff of that ugly, stinky, steaming pile! They all want to see the ring after you found it but you probably want to clean it up a bit before anyone else sees it.
The cure for all of these problems is analytics.
Analytics Create Information
Analytics (whether human or machine based) transform data into information. Personally I prefer to let the machines do the heavy lifting when it comes to analyzing millions of data points in real time but if you can take on that burden as a human then you have my utmost respect. Analytics can, and should occur at varying levels of your monitoring architecture.
Agents: If you have agents (and you need them to get detailed call traces from within your applications), they should be intelligent enough to know when to grab the full detail they are capable of and when to leave well enough alone. The ability to “do no harm” when your application is working great is often overlooked during the buying process and can mean the difference between a rapid return on investment or an expensive mistake. Intelligent agents create less application overhead, send less data across your network, and generate less data that needs to be stored.
Monitoring Server: I prefer that the agent have just enough intelligence to do the job described above. When an agent tries to do too much it can create excess overhead that might impact application performance or stability. The heavy lifting can be done on the server side of your monitoring architecture. This is where data gets converted into information via analytics. If the agent did it’s job properly (relaying the relevant data) you can perform very complex analytics on your monitoring server for a really large number of nodes without having to spend a ton of money on monitoring hardware.
Analytics Platform: I’m also a fan of external analytics platforms that can correlate data across many disparate nodes and make sense of the data chaos. You need to collect the right data before this type of platform provides its true value but mature organizations should be performing centralized analytics and behavioral learning of some sort.
Now that I have explained my views on data versus information what is the real point of this blog? I’m a performance geek. I have many years of suffering through root cause analysis under my belt. I learned over time and through experience what information was helpful to problem resolution and what was just noise. I remember one day of my career very clearly, it is the day I graduated from “just another systems administrator” to “I will stop at nothing to solve these performance issues”. Queue dreamy flashback sequence…
The phone rang, it was the operations center gathering the support personnel for an application that had components on a few of my servers. The app had been unacceptably slow (seriously, there was an accepted level of slow response time. Only the end users could tell us what was unacceptably slow though.) and nobody knew what was wrong. So like a good soldier I jumped onto the bridge line and fired up my command line tools to see what was happening from a server OS perspective (AIX in this case). As I’m digging through my metrics I come to a stark realization. In the absence of blatantly obvious problems (over 90% continuous CPU utilization, memory thrashing, high I/O wait, etc…) I have no concept of what is normal each metric I’m currently inspecting. Then I realize that even though I hear other folks on the call saying their components all look fine I have no idea what they are checking. I find myself wishing I had a tool that could track each transaction through this distributed application and tell me where it slowed down. That way we could all get off of this agonizing bridge line and let the people with the problem focus on fixing it.
Unfortunately it would be a few years until a product existed that could do what I was dreaming of that day. As it turns out, the problem was not related to my servers but I got to waste about 6 hours of my day trying to figure out what the problem was. So as a result of this futile exercise (and many other similar situations) I realized a core tenant of troubleshooting application problems. For any metric I am currently examining I must know it’s value under normal operating conditions as well as the degree of deviation from normal. This is the difference between data (my base metric) and information (the typical value and degree of deviation). Actionable information reduces MTTR while large amounts of data contribute to longer MTTR.
There is a ton of other information I need to troubleshoot most application problems but that is a topic for another blog post. Do yourself and your business a favor and make sure that your monitoring tools are providing actionable information instead of massive amounts of data that you will probably never use. If your tools fall into the latter category it’s probably time to rethink your strategy. Try an intelligent Application Performance Management (APM) solution and see all of the information you’ve been missing. Get started today with a free 30-day trial.