Chapter 9. Managing Crises and Escalations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Managing Crises and Escalations

There is no instance of a country having benefitted from prolonged warfare.

—Sun Tzu

You’ve probably never seen the 404 (page not found) error on Etsy’s site (www.etsy.com). The error page is a cartoon that depicts a woman knitting a three-armed sweater (Figure 9.1). Etsy is the de facto marketplace for handmade or vintage items, and as such it seems only fitting that the company commissioned an actual three-armed sweater that it uses to reward the engineer who most spectacularly brings down the site. While this fun fact is indicative of the amazing culture that thrives in Etsy’s technology teams, what’s more interesting about this 404 page is the site incident that it caused a few years ago. John Allspaw, Senior VP of Technical Operations at Etsy and author of The Art of Capacity Planning (O’Reilly Media, 2008) and Web Operations: Keeping the Data on Time (O’Reilly, 2010), tells the story. It all started with the removal of a CSS file. This particular CSS file supported outdated versions of Internet Explorer browser (IE versions less than 9.0). Everything appeared normal for a few minutes after the removal of the file—but then the CPU cycles on all 80 Web servers spiked to 100%.

Figure 9.1 Etsy’s 404 Error Page

Allspaw and his team have been incredibly focused on monitoring; they monitor more than 1 million time-based metrics such as registrations and payment types. All of this data gives the operations and software developers incredible insight into the systems. In Allspaw’s story, the monitors almost immediately began alerting that something was amiss. A combination of tech ops and software developers gathered in a virtual IRC chat room as well as a physical conference room in Etsy’s Brooklyn office. The team formed hypotheses and quickly began vetting them. Anthony, the engineer who would eventually get awarded the three-armed sweater for this incident, determined that during the space of time between the rsync of the static assets and the execution of the code (i.e., the Smarty tag that checked asset version), an error was being thrown. The code in question was executed on every request, not just IE browsers, which consumed all the available CPU cycles on all the servers.

Now that Allspaw’s team had identified the problem, the fix should have been easy—put back the missing CSS file. But not so fast: With all the CPU being consumed by the infinite loop, there wasn’t even enough CPU to allow an SSH connect onto the servers. The administrators connected to the iLO (integrated-Lights Out) cards to reboot the servers, but as soon as they came back online, the CPU spiked. The team started pulling servers out of the load balancer and then rebooting them, which gave them time to SSH into the servers and replace the CSS file. The administrators and network engineers were able to do this so quickly because Allspaw’s team had practiced synchronization and sequencing exercises many times before this incident. Because they already had a communication plan in place, they were able to communicate with their customers during the entirety of this incident through both the site EtsyStatus (http://etsystatus.com/) and the Twitter account @etsystatus. These communications reassured customers that Etsy was working on the problem.

Etsy is one of the best-prepared and best-executing teams in the industry. But even this execution prowess and preparedness doesn’t keep the company from having issues and needing to award the three-armed sweater to an engineer. Etsy understands how critical it is to respond quickly and appropriately to incidents when they happen—and take our word for it, they will happen. In this chapter, we discuss how to handle major crises both during the incident and afterward.

What Is a Crisis?

The Merriam-Webster Dictionary defines a crisis as “the turning point for better or worse in an acute disease or fever” and “a paroxysmal attack of pain, distress or disordered function.” A crisis can be both cathartic and galvanizing. It can be your Nietzsche event, allowing you to rise from the ashes like the mythical Phoenix. It can in one fell swoop fix many of the things we described in Chapter 6, Relationships, Mindset, and the Business Case, and force the company to focus on scale. If you’ve arrived here and take the right actions, your organization can become significantly better.

How do you know if an incident has risen to the level sufficient to qualify it as a crisis? To answer this question, you need to know the business impact of an incident. Perhaps a 30- to 60-minute failure between 1 a.m. and 1:30 a.m. is not really a crisis situation for your company, whereas a 3-minute failure at noon is a major crisis. Your business may be such that you make 30% of your annual revenue during the three weeks surrounding Christmas. As such, downtime during this three-week period may be an order of magnitude more costly than downtime during the remainder of the year. In this case, a crisis situation for you may be any downtime between the first and third weeks of December, whereas at any other point during the year you are willing to tolerate 30-minute outages. Your business may rely upon a data warehouse that supports hundreds of analysts between the hours of 8 a.m. and 7 p.m., but has nearly no usage after 7 p.m. or anytime during weekends. A crisis for you in this case may be any outage during working hours that would idle your analysts.

Of course, not all crises are equal, and obviously not everything should be treated as a crisis. Certainly, a brownout of activity on your Web site for 3 minutes Monday through Friday during “prime time” (peak utilization) is more of a crisis than a single 30-minute event during relatively low user activity levels. Our point here is that you must determine the unique crisis threshold for your company. Everything that exceeds this threshold should be treated as a crisis. Losing a leg is absolutely worse than losing a finger, but both require immediate medical attention. The same is true with crises; after the predefined crisis threshold is passed, each crisis should be approached the same way.

Recall from Chapter 8, Managing Incidents and Problems, that recurring problems (those problems that occur more than once) rob you of time and destroy your ability to scale your services and scale your organization. Crises also ruin scale, because they steal inordinately large amounts of resources. Allowing the root cause of a crisis to surface more than once will not only waste vast resources and keep you from scaling your organization and services, but also carries the risk of destroying your business.

Why Differentiate a Crisis from Any Other Incident?

You can’t treat a crisis like any normal incident, because it won’t treat you the way any normal incident would treat you. This is the time to restore service faster than ever before, and then continue working to get to the real root causes (problems) of said incident. With the clock ticking, customer satisfaction, future revenue, and even business viability are on the line.

Although we believe that there is a point at which adding resources to a project has diminishing and even negative returns, in a crisis you are looking for the shortest possible time to resolution rather than the efficiency or return on those resources. A crisis is not the time to think about future product delivery, as such thoughts and their resulting actions will merely increase the duration of the crisis. Instead, you need to lead by example and be at the scene of the crisis for as long as it is humanly possible and eliminate all other distractions from your schedule. Every minute that the crisis continues is another minute that destroys shareholder value.

Your job is to stop the crisis from causing a negative trend within your business. If you can’t fix it quickly by getting enough people on the problem to ensure that you have appropriate coverage, three things will happen. First, the crisis will perpetuate itself. Events will happen again and again, and you will lose customers, revenue, and maybe your business. Second, in allowing the crisis to siphon precious time out of your organization over a prolonged period, you will eventually lose traction on other projects. The very thing you were trying to avoid by not putting “all hands on deck” will happen anyway and you will have allowed the problem to go on longer than necessary. Third, you will lose credibility.

How Crises Can Change a Company

While crises are bad, if properly handled they can have significant long-term benefits to a company. The benefits are realized if, as a result of a crisis or series of crises, the company enacts long-term changes in its processes, culture, organizational structure, and architecture. In these cases, crisis serves as a catalyst for change.

Our job, of course, is to avert crises whenever possible given their cost to shareholders and employees. But when crises happen (and they will), we must adhere to the adage that “A crisis is a terrible thing to waste.” This adage indicates a strong desire to maximize the learning from a crisis through diligent postmortem assessments, and as such strengthen our people, processes, technologies, and architectures.

The eBay Scalability Crisis

As proof that a crisis can change a company, consider eBay in 1999. In its early days, eBay was the darling of the Internet. Until the summer of 1999, few, if any, companies had experienced such exponential growth in users, revenue, and profits. Throughout the summer of 1999, however, eBay experienced many outages, including a 20-plus hour outage in June. These outages were at least partially responsible for the reduction in stock price from a high in the mid-$20s in the week of April 26, 1999, to a low of $9.47 in the week of August 2, 1999.¹

1. August 4, 1999. http://finance.yahoo.com/q/hp?s=EBAY&a=07&b=2&c=1999&d=08&e=6&f=1999&g=d.

The cause of the outages isn’t really as important as what happened within eBay after the outages occurred. Additional executives were brought in to ensure that the engineering organization, the engineering processes, and the technology they produced could scale to the demand placed on them by the eBay community. Initially, additional capital was deployed to purchase systems and equipment (although eBay succeeded in lowering both its technology expense and capital on an absolute basis well into 2001). Processes were put in place to help the company design systems that were more scalable, and the engineering team was augmented with engineers experienced in high availability and scalable designs and architectures. Most importantly, the company created a culture of scalability. The lessons from the summer of pain are still discussed at eBay, and scalability has since become part of eBay’s DNA.

eBay continued to experience crises from time to time, but these crises were smaller in terms of their impact and shorter in terms of their duration as compared to the summer of 1999. Its implementation of a culture of scalability netted architectural changes, people changes, and process changes. One such change was eBay’s focus on managing each and every crisis in the fashion described in this chapter.

Order Out of Chaos

Bringing together and managing people from several different organizations during a crisis situation is difficult. Most organizations have their own unique subculture, and oftentimes those subcultures don’t speak the same technical language. It is entirely possible that an application developer will use terms with which a systems engineer is not familiar, and vice versa.

If left unmanaged, the collision of a large number of people from multiple organizations within a crisis situation will create chaos. This chaos will feed on itself, creating a vicious cycle that can actually prolong the crisis or—worse yet—aggravate the damage done in the crisis through someone taking an ill-advised action. Indeed, if you cannot effectively manage the force you throw at a crisis, you are better off using fewer people.

Your company may have a crisis management process that consists of both phone and chat (instant messaging or IRC) communications. If you listen on the phone or follow the chat session, you are very likely to see an unguided set of discussions and statements as different people and organizations go about troubleshooting. You may have questions asked that go unanswered or requests to try something that go without authorization. You might as well be witnessing a grade school recess, with different groups of children running around doing different things with absolutely no coordination of effort. But a crisis situation isn’t a recess—it’s a war, and in war such a lack of coordination results in an increased casualty rate through “friendly fire.” In a technology crisis, these friendly casualties are manifested as prolonged outages, lost data, and increased customer impact.

You must control the chaos. Rather than a grade school recess, you hope to see a high school football game. Ideally, you will witness a group of professionals being led with confidence to identify a path to restoration and subsequently a path to identification of root causes.

Different groups should have specific objectives and guidelines unique to their expertise. For all groups, there should be an expectation that they will report their progress clearly and succinctly in regular time intervals. Hypotheses should be generated, quickly debated, and either prioritized for analysis or eliminated as good initial candidates. These hypotheses should then be quickly restated as the tasks necessary to determine validity and handed out to the appropriate groups to work them, with times for results clearly communicated.

Someone on the call or in the crisis resolution meeting should be in charge, and that someone should be able to paint an accurate picture of the impact, the steps that have been tried, the best hypotheses being considered, and the timeline for completion of the current set of actions. Other members should be managers of the technical teams assembled to help solve the crisis and one of the experienced (described in organizations as senior, principal, or lead) technical people from each manager’s teams. We will now describe these roles and positions in greater detail. Other engineers should be gathered in organizational or cross-functional groups to deeply investigate domain areas or services within the platform experiencing a crisis.

The Role of the Problem Manager

The preceding paragraphs have been leading up to a position definition. We can think of lots of names for such a position: outage commander, problem manager, incident manager, crisis commando, crisis manager, issue manager, and (from the military) battle captain. Whatever you call this person, you need someone who is capable of taking charge on the phone. Unfortunately, not everyone can fill this kind of a role. We aren’t arguing that you need to hire someone just to manage your major production incidents to resolution (although if you have enough incidents, you might consider doing just that); rather, you should ensure that at least one person on your staff has the skills to manage such a chaotic environment.

The characteristics of the person who is capable of successfully managing chaotic environments are rather unique. As with leadership, some people are born with the ability to make order out of chaos, whereas others build this skill over time. An even larger group of folks have neither the desire nor the skills to lead in a time of crisis. Putting these people into crisis leadership positions can be disastrous. The person leading the crisis team absolutely needs to be technically literate, but does not necessarily need to be the most technical person in the room. This individual should be able to use his technical base to formulate questions and evaluate answers relevant to the crisis at hand. The problem manager does not need to be the chief problem solver, but rather needs to effectively manage the process followed by the chief problem solvers gathered within the crisis. This person also needs to be incredibly calm “inside,” yet persuasive “outside.” This might mean that the problem manager has the type of presence to which people are naturally attracted, or it might mean that he isn’t afraid to yell to get people’s attention within the room or on the conference call.

The problem manager needs to be able to speak and think in business terms. He needs to be sufficiently conversant with the business model to make decisions in the absence of higher guidance on when to force incident resolution over attempting to collect data that might be destroyed and would be useful in problem resolution (remember the differences in definitions from Chapter 8). The problem manager also needs to be able to create succinct, business-relevant summaries from the technical chaos that’s occurring to keep the remainder of the business informed about the crisis team’s progress.

In the absence of administrative help to document the event, the problem manager is responsible for ensuring that the actions and discussions are represented in a written form for future analysis. Thus, the problem manager will need to keep a history of the crisis and ensure that others are keeping their own histories, to be merged later. A shared chat room with timestamps enabled is an excellent choice for this type of documentation.

In terms of Star Trek characters and financial gurus, the ideal problem manager is one-third Scotty, one-third Captain Kirk, and one-third Warren Buffet. He is one-third engineer, one-third manager, and one-third business manager. He has a combat arms military background, an MBA, and an MS in some engineering discipline. Clearly, it will be difficult to find someone with the right blend of experience, charisma, and business acumen to perform such a function. To make the task even harder, when you find the person, he probably will not want the job, because it is a bottomless pool of stress. However, just as some people enjoy the thrill of being an emergency room physician, some operators enjoy leading teams through technical crises.

Although we flippantly suggested the MBA, MS, and military combat arms background, we were only half kidding. Such people actually do exist! As we mentioned earlier, the military has a role for people who manage its battles or what most of us would view as crises. The military combat arms branches attract many leaders and managers who thrive on chaos and are trained and have the personalities to handle such environments. Although not all former military officers have the right personalities, the percentage who do is significantly higher than in the rest of the general population. As a group, these leaders also tend to be highly educated, with many of them having at least one and sometimes multiple graduate degrees. Ideally, you would want someone who has proven himself both as an engineering leader and as a combat arms leader.

The Role of Team Managers

Within a crisis situation, a team manager is responsible for passing along action items to her teams and reporting progress, ideas, hypotheses, and summaries back to the crisis manager. Depending on the type of organization, the team manager may also be the “senior” or “lead” engineer on the call for her discipline or domain.

A team manager functioning solely in a management capacity is expected to manage her team through the crisis resolution process. The majority of this team will be located somewhere other than the crisis resolution (or “war”) room or on a call other than the crisis resolution call if a phone is being used. This means that the team manager must communicate and monitor the progress of the team as well as interact with the crisis manager. Although this may sound odd, the hierarchical structure with multiple communication channels is exactly what gives this process so much scale. This structured hierarchy affects scale in the following way: If every manager can communicate and control 10 or more subordinate managers or individual contributors, the capability in terms of human resources grows by orders of magnitude. The alternative is to have everyone communicate in a single room or through a single channel, which obviously doesn’t scale well. In such a scenario, communication becomes difficult and coordination of people becomes near impossible. People and teams quickly drown each other out in their debates, discussions, and chatter. Very little gets done in such a crowded environment.

Furthermore, this approach of having managers listen and communicate on two channels has proved very effective for many years in the military. Company commanders listen to and interact with their battalion commanders on one channel, and issue orders and respond to multiple platoon leaders on another channel (the company commander is in the upper-left portion of Figure 9.2). The platoon leaders then do the same with their platoons; each platoon leader speaks to multiple squads on a frequency dedicated to the platoon in question (see the center of Figure 9.2). Thus, although it might seem a bit awkward to have someone listen to two different calls or be in a room while issuing directions over the phone or in a chat room, this concept has worked well in the military since the advent of radio. It is not uncommon for military pilots to listen to four different radios at one time while flying the aircraft: two tactical channels and two air traffic control channels. In our consulting work, we have employed this approach successfully in several companies.

Figure 9.2 Military Communication

The Role of Engineering Leads

Each engineering discipline or engineering team necessary to resolve the crisis should have someone capable of both managing that team and answering technical questions placed within the higher-level crisis management team. This person serves as the lead individual investigator for her domain experience on the crisis management call and is responsible for helping the higher-level team vet information, clear ideas, and prioritize hypotheses. This person can also be on both the calls of the organization she represents and the crisis management call or conference, but her primary responsibility is to interact with the other senior engineers and the crisis manager to help formulate appropriate actions to end the crisis.

The Role of Individual Contributors

Individual contributors within the teams assigned to the crisis management call or conference communicate on separate chat and phone conferences or reside in separate conference rooms. They are responsible for generating and running down leads within their teams and work with the lead or senior engineer and their manager on the crisis management team. The individual contributor and his teams are additionally responsible for brainstorming potential problems that might be causing the incident, communicating them, generating hypotheses, and quickly proving or disproving those hypotheses. The teams should be able to communicate with the other domains’ teams either through the crisis management team or directly. All statuses should be communicated to the team manager, who is responsible for communicating this information to the crisis management team.

Communications and Control

Shared communication channels are a must for effective and rapid crisis resolution. Ideally, the teams will be relocated to be near one another at the beginning of a crisis. That means that the members of the lead crisis management team are in the same room and that all of the individual teams supporting the crisis resolution effort are located with one another to facilitate rapid brainstorming, hypothesis resolution, distribution of work, and status reporting. Too often, however, crises happen when people are away from work; because of this, both synchronous voice communication conferences (such as conference bridges on a phone) and asynchronous chat rooms should be employed.

The voice channel should be used to issue commands, stop harmful activity, and gain the attention of the appropriate team. It is absolutely essential that someone from each of the teams listens on the crisis resolution voice channel and simultaneously controls his or her team. In many cases, two representatives—the manager and the senior (or lead) engineer—should be present from each team on such a call. They serve as the command and control channel in the absence of everyone being in the same room. All shots are called from here, and this channel serves as the temporary change control authority and system for the company. The authority to do anything other than perform nondestructive “read” activities like investigating logs is first approved within this voice channel or conference room to ensure that two activities do not compete with each other and either cause system damage or result in an inability to determine which action “fixed” the system.

The chat or IRC channel is used to document all conversations and easily pass around commands to be executed so that time isn’t wasted in communication. Commands that are passed around can be cut and pasted for accuracy. Additionally, the timestamps within the IRC or chat can be used in follow-up postmortem sessions. The crisis manager is responsible for ensuring not only that he puts his own notes in the chat room and writes his decisions in the chat room for clarification, but also that status updates, summaries, hypotheses, and associated actions are put into the chat room.

In our experience, it is absolutely essential that both the synchronous voice and asynchronous chat channels remain open and available during any crisis. The asynchronous nature of chat allows activities to go on without interruption and allows individuals to monitor overall group activities between the tasks within their own assigned duties. Through this asynchronous method, scale is achieved while the voice channel allows for immediate command and control of different groups for immediate activities. Should everyone be in one room, there is no need for a phone call or conference call other than to facilitate experts who might not be on site and to give updates to the business managers. But even when everyone is present in the same room, a chat room should be opened and shared by all parties. In the case where a command is misunderstood, it can be buddy checked by all other crisis participants and even “cut and pasted” into the shared chat room for validation. The chat room allows actual system or application results to be shared in real time with the remainder of the group, and an immediate log with timestamps is generated when such results are cut and pasted into the chat.

The War Room

Phone conferences are a poor but sometimes necessary substitute for the “war room” or crisis conference room. So much more can be communicated when people are in a room together, as body language and facial expressions can be highly meaningful in a discussion. How many times have you heard someone say something, but when you read or looked at the person’s face you realized he was not convinced of the validity of his statement? Perhaps the person was not actually lying, but rather was passing along some information that he did not wholly believe. For instance, someone says, “The team believes that the problem could be with the login code,” but the scowl on her face shows that something is wrong. A phone conversation would not pick up on that discrepancy, but you have the presence of mind in person to question the team member further. She might answer that she doesn’t believe this scenario is possible given that the login code hasn’t changed in months, which might lower the priority for this hypothesis’s investigation. Alternatively, she might respond, “We just changed that damn thing yesterday,” which would increase the prioritization for investigation.

In the ideal case, the war room is equipped with phones, a large table space, terminals capable of accessing systems that might be involved in the crisis, plenty of work space, projectors capable of displaying key operating metrics or any person’s terminal, and lots of whiteboard space. Although the inclusion of a whiteboard might initially appear to be at odds with the need to log everything in a chat room, it actually supports chat activities by allowing graphics, symbols, and ideas best expressed in pictures to be drawn quickly and shared. These concepts can then be reduced to words and placed in chat, or a picture of the whiteboard can be taken and sent to the chat members. Many new whiteboards even have systems capable of reducing their contents to pictures immediately. Should you have an operations center, the war room should be close to that to allow easy access from one area to the next.

You might think that creating such a war room would be a very expensive proposition. “We can’t possibly afford to dedicate space to a crisis,” you might say. Our answer is that the war room need not be expensive or dedicated to crisis situations. It simply needs to be given a priority in any crisis. As such, any conference room equipped with at least one and preferably two lines or more will do. Moreover, the war room is useful for the “ride along” situation described in Chapter 6. If you want to make a good case for why you should invest in creating a scalable organization, scalable processes, and a scalable technology platform, invite some business executives into a well-run war room to witness the work necessary to fix scale problems that result in a crisis. One word of caution here: If you can’t run a crisis well and make order out of its chaos, do not invite people into the fray. Instead, focus your time on finding a leader and manager who can run such a crisis and then invite other executives when you feel more confident about your crisis management system.

Tips for a Successful War Room

A good war room has the following:

• Plenty of whiteboard space

• Computers and monitors with access to the production systems and real-time data

• A projector for sharing information

• Phones for communication to teams outside the war room

• Access to IRC or chat

• Workspace for the number of people who will occupy the room

War rooms tend to get loud, and the crisis manager must maintain control within the room to ensure that communication is concise and effective. Brainstorming can and should be used, but limit communication during discussion to one individual at a time.

Escalations

Escalations during crisis events are critical for several reasons. The first and most obvious is that the company’s job in maximizing shareholder value is to ensure that it isn’t destroyed in these events. As such, the CTO, the CEO, and other executives need to hear quickly of issues that are likely to take significant time to resolve or have a significant negative customer impact. In a public company, it’s even more important that the senior execs know what is going on, because shareholders demand that they know about such things; indeed, public-facing statements may need to be made. Moreover, well-informed executives are more likely to be able to marshal all of the resources necessary to bring a crisis to resolution, including customer communications, vendors, partner relationships, and so on.

The natural tendency for engineering teams is to believe that they can solve the problem without outside help or without help from their management teams. That may be true, but solving the problem isn’t enough—it needs to be resolved the quickest and most cost-effective way possible. Often, that will require more than the engineering team can muster on their own, especially if third-party providers are to blame for the incident. Moreover, communication throughout the company is important, because your systems are either supporting critical portions of the company or—in the case of Web companies—they are the company. Someone needs to communicate with shareholders, partners, customers, and maybe even the press. People who aren’t involved in fighting are the best options to handle that communication.

Think through your escalation policies and get buy-in from senior executives before you have a major crisis. It is the crisis manager’s job to adhere to those escalation policies and get the right people involved at the time defined in the policies, regardless of how quickly the problem is likely to be solved after the escalation.

Status Communications

Status communications should happen at predefined intervals throughout the crisis and should be posted or communicated in a somewhat secure fashion, so that the organizations needing information on resolution time can get the information they need to take the appropriate actions. Status communications differ from escalation. Escalation is undertaken to bring in additional help as time drags on during a crisis, whereas status communications are made to keep people informed. Using the RASCI framework, you escalate to R, A, S, and C personnel, and you post status communication to I personnel.

A status message should include the start time, a general update of actions since the start time, and the expected resolution time if known. This resolution time is important for several reasons. Perhaps you are supporting a manufacturing center, and the manufacturing manager needs to know if she should send her hourly employees home. Or perhaps you provide sales or customer support software in SaaS fashion, and those companies need to figure out what to do with their sales and customer support staff.

Your crisis process should clearly define who is responsible for communicating to whom, but it is the crisis manager’s job to ensure that the timeline for communications is followed and that the appropriate communicators are properly informed. A sample status email is shown in Figure 9.3.

Figure 9.3 Status Communication

Crisis Postmortem and Communication

Just as a crisis is an incident on steroids, so a crisis postmortem is a juiced-up postmortem. Treat this postmortem with extra-special care. The systems that you helped create and manage have just caused a huge problem for a lot of people. This isn’t the time to get defensive; it is the time to be reborn. Recognize that this meeting will fulfill or destroy the process of turning around your team, setting up the right culture, and fixing your processes.

Absolutely everything should be evaluated. The very first crisis postmortem is referred to as the “master postmortem,” and its primary task is to identify subordinate postmortems. It is not intended to resolve or identify all of the issues leading to the incident, but rather to identify the areas that subordinate postmortems should address. For example, you might have postmortems focused on technology, process, and organization failures. You might have several postmortems on technology covering different aspects—one on your communication process, one on your crisis management process, and one on why certain organizations didn’t contribute appropriately early on in the postmortem.

Just as you had a communication plan during your crisis, so you must have a communication plan that remains in effect until all postmortems are complete and all problems are identified and solved. Keep all members of the RASCI chart updated, and allow them to update their organizations and constituents. This is a time to be completely transparent. Explain, in business terms, everything that went wrong and provide aggressive but achievable dates in your action plan to resolve all problems. Follow up with communication in your staff meeting, your boss’s staff meeting, and/or the company board meeting. Communicate with everyone else via email or whatever communication channel is appropriate for your company. For very large events where morale might be impacted, consider conducting a company all-hands meeting, to be followed by weekly updates via email or on a blog.

A Note on Customer Apologies

When you communicate to your customers, buck the recent trend of apologizing without actually apologizing and try sincerity instead. Actually mean that you are sorry for disrupting their businesses, their work, and their lives! Too many companies use the passive voice, point the fingers in other directions, or otherwise misdirect customers as to true root cause. If you find yourself writing something like “Our company experienced a brief 6-hour downtime last week and we apologize for any inconvenience that this may have caused you,” stop right there and try again. Try the first person “I” instead of the third person “we,” drop the “may” and “brief,” acknowledge that you messed up what your customers were planning on doing with your application, and get your message posted immediately.

It is very likely that your crisis will have significantly affected your customers. Moreover, this negative customer impact is not likely to have been the fault of the customer. Acknowledge your mistakes and be clear about what you plan to do to ensure that they do not happen again. Your customers will appreciate your forthrightness, and assuming that you can make good on your promises, you are more likely to have happy and satisfied customers over the long term.

Conclusion

Not every incident is created equally; some incidents require significantly more time to truly identify and solve all of the underlying problems. You should have a plan to handle such crises from inception to end. The end of the crisis management process is the point at which all problems identified through postmortems have been resolved.

The technology team is charged with responding to, resolving, and handling the problem management aspects of a crisis. The roles on this team include the problem manager/crisis manager, engineering managers, senior engineers/lead engineers, and individual contributor engineers from each of the technology organizations.

Four types of communication are necessary in crisis resolution and closure: internal communications, escalations, and status reports both during and after the crisis. Handy tools for crisis resolution may also be employed, including conference bridges, chat rooms, and the war room concept.

Teams who either don’t see crises a lot or are new to the process should consider drilling in crisis management. Practice what each person would do so that when an actual crisis occurs—and it inevitably will—the team is amply prepared to manage it effectively.

Key Points

• Crises are incidents on steroids that can either make your company stronger or kill your business. Crises, if not managed aggressively, will destroy your company’s ability to scale its customers, its organizational structure, and its technology platform and services.

• To resolve crises as quickly and cost-effectively as possible, you must contain the chaos with some measure of order.

• The leaders who are most effective in crises are calm on the inside but capable of forcing and maintaining order throughout the crisis management process. They must have business acumen and technical experience and be calm leaders under pressure.

• The crisis resolution team consists of the crisis manager, engineering managers, and senior engineers. In addition, teams of engineers reporting to the engineering managers are employed.

• The role of the crisis manager is to maintain order and follow the crisis resolution, escalation, and communication processes.

• The role of the engineering manager is to manage her team and provide status to the crisis resolution team.

• The role of the senior engineer from each engineering team is to help the crisis resolution team create and vet hypotheses regarding cause and to help determine the most appropriate rapid resolution approaches.

• The role of the individual contributor engineer is to participate in his team and identify rapid resolution approaches, create and evaluate hypotheses about the cause of the crisis, and provide status information to his manager on the crisis resolution team.

• Communication between crisis resolution team members should happen face to face in a crisis resolution or war room; when face-to-face communication isn’t possible, the team should use a conference bridge on a phone. A chat room should also be employed.

• War rooms, which are ideally sited adjacent to operations centers, should be developed to help resolve crisis situations.

• Escalations and status communications should be defined during a crisis. After a crisis, the crisis process should provide status updates at periodic intervals until all root causes are identified and fixed.

• Crisis postmortems should be vigorous examinations that identify and manage a series of follow-ups in the form of subordinate postmortems that thematically attack all issues identified in the master postmortem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9. Managing Crises and Escalations

Create new playlist

Sign In

Sign Up

Chapter 9. Managing Crises and Escalations

What Is a Crisis?

Why Differentiate a Crisis from Any Other Incident?

How Crises Can Change a Company

Order Out of Chaos

The Role of the Problem Manager

The Role of Team Managers

The Role of Engineering Leads

The Role of Individual Contributors

Communications and Control

The War Room

Escalations

Status Communications

Crisis Postmortem and Communication

Conclusion

Key Points

Table of Contents for
Chapter 9. Managing Crises and Escalations