Have you ever tried to troubleshoot a production bottleneck or outage? Let me guess: the first place you’ll look is log files, right? And in those log files lay all the answers to your problems? Err, not exactly. Log files are like haystacks; they take up lots of space and take hours to find the precious needles that are causing you pain. Even with tools like “kerplunk,” your troubleshooting success is only as good as the data you can collect, manage and report. You can’t log everything because disk I/O and debug logging is an expensive operation, which is why most log files today only contain basic information about what applications are doing in their various infrastructure silos.
For example, if you need to troubleshoot how a slow distributed business transaction executed across your infrastructure, it could take you hours or never to piece together a jigsaw. And if one piece of your jigsaw is missing, then the trail goes cold and you shrug your head. Bottom line, managing application performance with log files is still a long, manual and tedious task with no guarantee of success. You can’t manage with facts if you don’t have all the facts in the first place.
So even with tools that help you parse and index log files, you’re still dependent on the right data being captured and available. Pointing the finger with weak or incomplete evidence just fuels the fire when it comes to figuring out who and what is causing the issue. For example, have you ever tried telling a DBA his database is the issue just because it’s slow?