Applications typically log additional data such as exceptions to different data sources. Windows event logs, local files, and SQL databases are most commonly used in production. New applications can take advantage of leveraging big data instead of individual files or SQL.
One of the most surprising experiences when we start monitoring applications is noticing the logging is not configured properly in production environments. There have been two types of misconfiguration errors we’ve seen often in the field:
logging configuration is copied from staging settings
while deploying the application to production environment, logging wasn’t fully configured and the logging failed to log any data
To take a closer look, I have a couple of sample applications to show how the problems could manifest themselves. These sample applications were implemented using MVC5 and are running in Windows Azure and using Microsoft Enterprise Library Exception Handling and Logging blocks to log exceptions to the SQL database. There is no specific preference regarding logging framework or storage, just wanted to demonstrate problems similar to what we’ve seen with different customers.
Situation #1 Logging configuration was copied from staging to production and points to the staging SQL database
When we installed AppDynamics and it automatically detected the application flowmap, I noticed the application talks to the production UserData database and… a staging database for logging.
The other issue was the extremely slow response time while calling the logging database. The following snapshot can explain the slow performance, as you see there’s an exception happening while trying to run an ADO.NET query:
Exception details confirm the application was not able to connect to a database, which is expected — the production environment in located in DMZ and usually can’t reach a staging network.
To restate what we see above — this is a failure while trying to log the original exception which could be anything from a user not being able to log into the website to failing to checkout.
At the same time the impact is even higher because the application spends 15 seconds trying to connect to the logging database and timeout, all while the user is waiting.
Situation #2 During deployment the service account wasn’t granted permissions to write to the logging database
This looks similar to the example above but when we drill inside the error we can see the request has an internal exception happened during the processing:
The exception says the service account didn’t have permissions to run the stored procedure “WriteLog” which logs entries to the logging database. From the performance perspective, the overhead of security failure is less from timeouts in the example above but the result is the same — we won’t be able to see the originating exception.
Not fully documenting or automating the application deployment/configuration process usually causes such problems.
These are one-time issues that once you fix it will work on the machine. However, next time you deploy the application to a new server or VM this will happen again until you fix the deployment.
Let’s check the EntLigLogging database — it has no rows
Here’s some analysis to explain why this happened:
We found exceptions when the application was logging the error
This means there was an original error and the application was trying to report it using logging
Logging failed which means the original error was never reported!
And… logging doesn’t log anywhere about its failures, which means from a logging perspective the application has no problems!!
This is logically correct — if you can’t log data to the storage database you can’t log anything. Typically, loggers are implemented similar to the following example:
Logging is the last option in this case and when it fails nothing else happens as you see in the code above.
Just to clarify, AppDynamics was able to report these exceptions because the agent instruments common methods like ADO.NET calls, HTTP calls, and other exit calls as well as error handlers, which helped in identifying the problem.
Going back to our examples, what if the deployment and configuration process is now fixed and fully automated so there can’t be a manual mistake? Do you still need to worry? Unfortunately, these issues happen more often than you’d expect, here is another real example.
Situation #3 What happens when the logging database fills up?
Everything is configured correctly but at some point the logging database fills up. In the screenshot above you can this this happened around 10:15pm. As a result, the response time and error rates have spiked.
Here is one of the snapshots collected at that time:
You can see that in this situation it took over 32 seconds trying to log data. Here are the exception details:
The worst part is at 10:15pm the application was not able to report about its own problems due to the database being completely full, which may incorrectly be translated that the application is healthy since it is “not failing” because there are no new log entries.
We’ve seen enough times that the logging database isn’t seen as a critical piece of the application therefore it gets pushed down the priority list and often overlooked. Logging is part of your application logic and it should fall into the same category as the application. It’s essential to document, test, properly deploy and monitor the logging.
This problem could be avoided entirely unless your application receives an unexpected surge of traffic due to a sales event, new release, marketing campaign, etc. Other than the rare Slashdotting effect, your database should never get to full capacity and result in a lack of logging. Without sufficient room in your database, your application’s performance is in jeopardy and you won’t know since your monitoring framework isn’t notifying you. Because these issues are still possible, albeit during a large load surge, it’s important to continuously monitor your loggingn as you wouldn’t want an issue to occur during an important event.
Logging adds a new dependency to the application
Logging can fail to log the data – there could be several reasons why
When this happens you won’t be notified about the original problem or a logging failure and the performance issues will compound
This would never happen to your application, would it?