Call In a Specialist

It felt like half of forever (but was probably only half an hour) when the support engineer dialed in to the bridge. He explained that of the four servers that normally handle scheduling, two were down for maintenance over the holiday weekend and one of the others was malfunctioning for reasons unknown. To this day, I have no idea why they would schedule maintenance for that weekend of all weekends!

That left us with a huge imbalance in the sizes of the systems, as shown in the following figure. The sole scheduling server that remained could handle up to twenty-five concurrent requests before it started to slow down and hang. We estimated that right then the order management system was probably sending it ninety requests. Sure enough, when the on-call engineer checked the lone scheduling server, it was stuck at 100 percent CPU. He had gotten paged a few times about the high CPU condition but had not responded, since that group routinely gets paged for transient spikes in CPU usage that turn out to be false alarms. All the false positives had quite effectively trained them to ignore high CPU conditions.

images/case_study_living_space/scheduling.png

On the conference call, our business sponsor gravely informed us that marketing had prepared a new insert that hit newspapers Friday morning. The ad offered free home delivery for all online orders placed before Monday. The entire line, with fifteen people in a conference room on speakerphone and a dozen more dialed in from their desks, went silent for the first time in four hours.

So, to recap, we have the front-end system, the online store, with 3,000 threads on 100 servers and a radically changed traffic pattern. It’s swamping the order management system, which has 450 threads that are shared between handling requests from the front end and processing orders. The order management system is swamping the scheduling system, which can barely handle twenty-five requests at a time.

And it’s going to continue until Monday. It’s the nightmare scenario. The site is down, and there’s no playbook for this situation. We’re in the middle of an incident, and we have to improvise a solution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.187.24