Engineering A Better Way To Manage Performance

For competitive and confidentiality reasons, our customer has asked us not to identify them by name.

When your business is to execute major engineering and construction projects around the world that add up to $3 billion in annual revenue, software plays a critical role in managing projects through their lifecycle — the revenue of the company depends on it. And like the projects themselves, those software systems tend to be large-scale and complex, in this case built largely but not exclusively on Oracle and IBM products. Unfortunately, these systems don’t provide adequate tools or visibility to quickly get to the root causes of issues — just “a lovely, horrific log file to dig through” in the words of the Systems Integrator, who is responsible for performance. But AppDynamics had the products that offered the visibility and functionality to keep the software running and keep the company focused on what it does best: engineering and building big, exciting projects.

Challenge 1

Monday mornings are bad enough. But when the contract management application was crashing just about every Monday morning, it was a serious problem. To add insult to injury, performance complaints were also piling up, typically on Fridays. Nothing on the server, network, or database indicated the cause for the slowness; logs indicated the Java heap was running out of memory, but there was no clue as to why. Even more perplexing, another identical system with an almost identical load wasn’t showing any of the same symptoms. And the problem was going on for nearly six months.

Solution 1

The application’s memory usage was a “black box,” as described by the customer, and they needed a way to get a deeper look into it. They shopped around for an application or platform that would give them the visibility they needed, and landed on AppDynamics.

“After using AppDynamics to analyze our applications’ memory usage and performance side-by-side with the identical system that was running stably, it became very apparent that the problem lie with the garbage collection (GC) process of the JVM,” the customer explained. “Specifically, its inability to keep up with the creation and destruction of the objects in memory. In our case, there were no minor GCs being run, and major GCs were occurring on almost a per-minute basis, until ultimately the memory heap would run out and the application would crash.”

With AppDynamics, we were able to monitor all aspects of our application servers on a real-time basis for all tiers of the platform.

This diagnosis also explained the performance complaints, as garbage collection is a stop-the-world event, meaning all processing is suspended while the GC takes place.

But this still didn’t explain why this system was bogging down and crashing and an identical system wasn’t. As it turns out, they were not 100% identical. The AppDynamics correlation analysis tools pointed the finger at a particular set of (proprietary) integrations, which actually were not being run on the “identical” system.

“In the end, it just came down to telling that vendor, ‘you’re wrong, you’re causing the problem,’” the customer said, since they were able to show the AppDynamics data that precisely pinpointed the issue. “Fortunately, we could schedule that integration when our end-user typical daily load is not high, so it didn’t cause the GC to fall behind and ultimately run out the heap and crash the application.”

Challenge 2

In a separate incident, intermittent reports of slow performance were coming in from users of multiple applications on multiple servers. Like trying to recreate a noise for your auto mechanic, these intermittent issues can be some of the most difficult problems to diagnose. The one common denominator was they were all in the DMZ, the “demilitarized zone,” that buffer area between the internal network and the world. Everything was checked — server performance metrics including CPU and RAM utilization, routers, switches, firewalls, SQL calls. Everything appeared normal; of course, the noise won’t happen when the mechanic drives your car. That’s pretty much what the Java application vendor said: You’ll have to isolate the problem to an end-user experience, and then maybe we can help.

Getting to the root cause of performance problems in sprawling, heterogeneous networks is a huge challenge, but one that software-driven businesses absolutely must master.

Solution 2

Now that AppDynamics was in place, it didn’t take long to find the problem.

“With AppDynamics, we were able to monitor all aspects of our application servers on a real-time basis for all tiers of the platform,” the customer said. “Add to this the ability to monitor multiple servers and applications simultaneously, along with the built-in correlation analysis tools, and finally the root cause of the slow end-user experience was easily determined. The firewall throughput was too small for the load coming from the DMZ into the internal network.”

AppDynamics diagnosis. New, higher capacity firewall installed. Problem solved.

I fell in love with AppDynamics just because it was so easy to use out of the box. I didn’t have to spend a lot of time tweaking it.

Benefits

Getting to the root cause of performance problems in sprawling, heterogeneous networks is a huge challenge, but one that software-driven businesses absolutely must master. AppDynamics provided this global engineering-construction firm with the visibility and analytical tools to solve perplexing, persistent issues even when the application vendors couldn’t or wouldn’t.

Ease-of-use and historical data are two AppDynamics benefits this customer cites in particular.

“I fell in love with AppDynamics just because it was so easy to use out of the box,” the customer said. “I didn’t have to spend a lot of time tweaking it.”

“The best thing for me about AppDynamics is the ability to capture an issue historically,” the customer continued, “To give us the chance to use the time range window and then go and trace it all the way to the SQL call. Then we can look at other logs. You can’t touch that with the other tools that we have. I love how easy it is to get to that without having to do a lot of programming, coding, and filtering, to quickly resolve what the problem was.”