Vital Signs

The incident had started about twenty minutes before Daniel called me. The operations center had escalated to the on-site team. David, the operations manager, had made the choice to bring me in as well. Too much was on the line for our client to worry about interrupting a vacation day. Besides, I had told them not to hesitate to call me if I was needed.

We knew a few things at this point, twenty minutes into the incident:

  • Session counts were very high, higher than the day before.

  • Network bandwidth usage was high but not hitting a limit.

  • Application server page latency (response time) was high.

  • Web, application, and database CPU usage were low—really low.

  • Search servers, our usual culprit, were responding well. System stats looked healthy.

  • Request-handling threads were almost all busy. Many of them had been working on their requests for more than five seconds.

In fact, the page latency wasn’t just high. Because requests were timing out, it was effectively infinite. The statistics showed us only the average of requests that completed. Response time is always a lagging indicator. You can only measure the response time on requests that are done. So whatever your worst response time may be, you can’t measure it until the slowest requests finish.

Requests that didn’t complete never got averaged in. Other than the long response time, which we already knew about since SiteScope was failing to complete its synthetic transactions, none of our usual suspects looked guilty.

To get more information, I started taking thread dumps of the application servers that were misbehaving. While I was doing that, I asked Ashok, one of our rock-star engineers who was on-site in the conference room, to check the back-end order management system. He saw similar patterns on the back end as on the front end: low CPU usage and most threads busy for a long time.

It was now almost an hour since I got the call, or ninety minutes since the site went down. This means not only lost orders for my client but also that we were coming close to missing our SLA for resolving a high-severity incident. I hate missing an SLA. I take it personally, as do all of my colleagues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.186.201