Roku was originally created as a device to stream Netflix to your TV, but since has expanded to stream 600+ channels from various content providers. With several million users in the US and worldwide, Roku’s software must work seamlessly around the clock. Nils Pommerien, Manager of Network Engineering at Roku, is responsible for application performance and making sure Roku’s end user experience is second to none.
Before purchasing AppDynamics, Pommerien had very little visibility into the performance of his applications running in production. Pommerien and his team used tools like Nagios and the log4net framework to monitor the health of their applications, but this was not a very efficient way to solve problems.
“About a year ago, we had a performance problem that would occasionally lock up one of our production nodes,” Pommerien said. “There was some convoluted set of behaviors in our application that basically caused an infinite loop in our code.”
Every so often a user would set off this process, and Pommerien’s team would eventually have to restart the affected server. “We’d recycle the app pool, but we were getting no data about the problem, so solving it was tricky,” he said. “We had five engineers in a room and ten theories about what was going on.”
Ultimately Pommerien and his team decided to attach profilers to each production node. When they saw one of the cores begin to lock up, they began collecting data with the profiler, and soon they found the line of code that was being executed over and over.
“It was a one-line problem,” Pommerien said. “It took us 10 days between noticing the problem and getting out a fix. This would have taken five or ten minutes to resolve with AppDynamics.”
Pommerien and his team eventually decided that they needed more visibility into their application, over and above what log files and Nagios could offer. Pommerien was responsible for the performance of one application that relied heavily on third party service providers for content, and the performance of these providers’ websites was variable. Visibility into these web service calls could not only help Pommerien ensure that these providers were meeting their SLAs, but it could also help him to protect Roku’s brand from their performance problems.
Furthermore, because Roku is primarily a hardware manufacturer, its release schedule is a little different from other web applications. The version of software that is flashed onto a device to be shipped must perform perfectly. If there is a problem brought about by an obscure use case, Pommerien must be able to quickly identify and fix it.
To solve these problems, Pommerien and his team decided to buy AppDynamics.
One of Roku’s third party providers recently experienced a 24-hour outage. A week later, they returned to their partners like Roku to say that their systems were fully functional again. However Pommerien began to notice in AppDynamics that this wasn’t entirely true. “It seemed that every once in a while an API call would simply fail,” he said. “So we put together some graphs from AppDynamics to show all transactions with this particular provider over the past week. Out of 10,000 API calls, 200 failed completely.”
AppDynamics has also benefited Roku’s developers, who can see how their code performs in production for the first time. “Our developers love it,” Pommerien said. “Now they can see what parts of their code execute the most, or the slowest.” His developers have also used AppDynamics to test how effective their implementation of memcached is, by looking at the difference between total requests, cache hits and connections to the database.
Excluding the occasional issues with third party service providers, Pommerien has not had any performance problems since deploying AppDynamics. Even so, he said he spends more time monitoring today than he did before. “Whenever I have a spare moment, in a boring meeting or wherever, I check in on AppDynamics,” he said. “I’m definitely spending more time monitoring performance than before, but I don’t see that as a bad thing. I know exactly what’s happening in my application all the time. That’s awesome.”