Unfortunately, a lack of visibility into the application layer made this a difficult task. The development team put new code in production every two weeks to support changing promotions, and this often had a severe impact on application performance. Without visibility into what was causing performance bottlenecks, Narain and the development team could not effectively triage outages and slowdowns.
When an outage occurred, it was up to Narain's team to find the problem in the logs. With about 50 Tomcat instances, this exercise was extremely time-consuming. It often took 20-30 employee hours to find the point of failure, and even then the problem was far from being resolved. “We usually attempted to solve problems just by adding more resources - bringing in more Tomcat instances, or raising the number of connections to the databases,” Narain said.
These performance problems were detrimental both to the customers that used the website and the call center agents who placed orders from the phone; the slower the request, the longer it took to place an order and the longer the phone call. “The average talk time on a call should be about two minutes and 20 seconds. Ours was 2:25-2:35, which may not sound like that much more, but ten seconds in this industry is huge,” Narain said. “Ten seconds can mean up to 20-30 less agents helping customers during our peak.”
Call length was further impacted when developers released new code into production. “Almost every Friday evening we would either see a slowdown - talk times would usually be 2:47-2:50 - or something would crash,” Narain added.
While the crashes usually only affected a couple of agents, sometimes they could be much more damaging. “We had a couple of incidents that were pretty costly,” Narain said. “One outage took down our entire call center for one hour on a Friday night. That outage probably cost us $100,000-150,000. That's what pushed us to really understand what was happening.”