9 Wasting a perfectly good incident

This chapter covers

  • Conducting blameless postmortems
  • Addressing the mental models people have during an incident
  • Generating action items that further the improvement of the system

When something unexpected or unplanned occurs that creates an adverse effect on the system, I define that action as an incident. Some companies reserve the term for large catastrophic events, but with this broader definition, you get to increase the learning opportunities on your team when an incident occurs.

As mentioned previously, at the center of DevOps is the idea of continuous improvement. Incremental change is a win in a DevOps organization. But the fuel that powers that continuous improvement is continual learning--learning about new technologies, existing technologies, how teams operate, how teams communicate, and how all these things interrelate to form the human-technical systems that are engineering departments.

One of the best sources for learning isn’t when things go right, but when they go wrong. When things are working, what you think you know about the system and what is actually true in the system aren’t necessarily in conflict. Imagine you have a car with a 15-gallon gas tank. For some reason, you think the gas tank is 30 gallons, but you have this habit of filling your gas tank after you burn through about 10 gallons. If you do this religiously, your understanding of the size of the gas tank never comes into conflict with the reality of the gas tank being only 15 gallons. You might make hundreds of trips in your car without ever learning a thing. But the minute you decide to take that long drive, you run into problems at 16 gallons. Before long, you realize the folly of your ways and start taking the appropriate precautions now that you have this newfound information.

Now there are a few things you can do with this information. You can dig deep to understand why your car ran out of gas at gallon 15 or you can just say, “Welp, I better start filling up every five gallons now, just to be safe.” You’d be amazed how many organizations opt to do the latter.

Many organizations don’t go through the mental exercise of understanding why the system performed the way it did and how it can improve. Incidents are a definitive way to prove whether your understanding of the system matches the reality. By not doing this exercise, you’re wasting the best parts of the incident. The failure to learn from such an event can be a disservice to future efforts.

The lessons from system failures don’t always come naturally. They often need to be coaxed out of the system and team members in an organized, structured fashion. This process is called by many names; after-action reports, incident reports, and retrospectives are just a few terms. But I use the term postmortem.

DEFINITION A postmortem is the process by which a team assesses the events that led up to an incident. The postmortem typically takes the format of a meeting with all of relevant stakeholders and participants of the incident handling.

In this chapter, I will discuss the process and structure of the postmortem, as well as how to get a deeper understanding of your systems by asking deeper, more probing questions about why engineers decided to take the action that they did.

9.1 The components of a good postmortem

Whenever there’s an incident of enough size, people begin to play the blame game. People try to distance themselves from the problem, erect barriers to information, and generally become helpful only to the point of absolving themselves of fault. If you see this happening in your organization, you likely live in a culture of blame and retribution: the response to an incident is to find those responsible for “the mistake” and to make sure that they’re punished, shamed, and sidelined appropriately.

Afterward, you’ll heap on a little extra process to make sure that someone must approve the type of work that created the incident. With a feeling of satisfaction, everyone will walk away from the incident knowing that this particular problem won’t happen again. But it always does.

The reason the blame game doesn’t work is that it attacks the people as the problem. If people had just been better trained. If more people were aware of the change. If someone had followed the protocol. If someone hadn’t mistyped that command. And to be clear, these are all valid reasons things might go wrong, but they don’t get to the heart of why that activity (or lack of) created such a catastrophic failure.

Let’s take the training failure as an example. If the engineer wasn’t trained appropriately and made a mistake, you should ask yourself, “Why wasn’t he trained?” Where would the engineer have gotten that training? Was the lack of training due to an engineer not having enough time? If they weren’t trained, why were they given access to the system to perform something they weren’t ready to perform?

The pattern with this other line of thinking is that you’re discussing problems in the system versus problems in the individual. If your training program is poorly constructed, blaming this engineer doesn’t solve the problem, because the next wave of hires might experience the same problem. Allowing someone who might not be qualified to perform a dangerous action might highlight a lack of systems and security controls in your organization. Left unchecked, your system will continue to produce employees who are in a position to make this same mistake.

To move away from the blame game, you must begin thinking about how your systems, processes, documentation, and understanding of the system all contribute to the incident state. If your postmortems turn into exercises of retribution, no one will participate, and you’ll lose an opportunity for continued learning and growth.

Another side effect of a blameful culture is a lack of transparency. Nobody wants to volunteer to get punished for a mistake they made. Chances are, they’re already beating themselves up about it, but now you combine that with the public shaming that so often accompanies blameful postmortems and you’ve now built in incentives for people to hide information about incidents or specific details about an incident.

Imagine an incident that was created by an operator making a mistake entering a command. The operator knows that if he admits to this error, some type of punishment will be waiting for him. If he has the ability to sit silently on this information, knowing that there’s punishment for the mistake, he’s much more likely to sit silently while the group spends a large amount of time attempting to troubleshoot what happened.

A culture of retribution and blamefulness creates incentives for employees to be less truthful. The lack of candidness hinders your ability to learn from the incident while also obfuscating the facts of the incident. A blameless culture, whereby employees are free from retribution, creates an environment much more conducive to collaboration and learning. With a blameless culture, the attention shifts from everyone attempting to deflect blame, to solving the problems and gaps in knowledge that led to the incident.

Blameless cultures don’t happen overnight. It takes quite a bit of energy from coworkers and leaders in the organization to create an environment in which people feel safe from reprisal and can begin having open and honest discussions about mistakes that were made and the environment in which they were made. You, the reader, can facilitate this transformation by being vulnerable and being the first to share their own mistakes with the team and the organization. Someone must always go first, and since you’re reading this book, that person is probably going to be you.

9.1.1 Creating mental models

Understanding how people look at systems and processes is key to understanding how failure happens. When you work with or are a part of a system, you create a mental model of it. The model reflects how you think the system behaves and operates.

DEFINITION A mental model is an explanation of someone’s thought process about how something works. The mental model might detail someone’s perception of the relationship and interaction between components, as well as how the behavior of one component might influence other components. A person’s mental models can often be incorrect or incomplete.

Unless you’re a total expert on that system, however, it’s reasonable to assume that your model has gaps in it. An example is that of a software engineer and their assumptions of what the production environment might look like. The engineer is aware that there’s a farm of web servers and a database server and a caching server. They’re aware of these things because those are the components that they touch and interact with on a regular basis, both in code and in their local development environments.

What they’re probably unaware of is all the infrastructure components that go into making this application capable of handling production-grade traffic. Database servers might have read replicas, and web servers probably have a load balancer in front of them and a firewall in front of that. Figure 9.1 shows an engineer’s model versus the reality of the system.

It’s important to acknowledge this discrepancy not just in computer systems, but in processes as well. The gap between expectations and reality is a nesting ground for incidents and failures. Use the postmortem as an opportunity to update everyone’s mental model of the systems involved in the failure.

Figure 9.1 The engineer’s mental model (top) versus reality (bottom)

9.1.2 Following the 24-hour rule

The 24-hour rule is simple: if you have an incident in your environment, you should have a postmortem about that incident within 24 hours. The reasons for this are twofold.

For starters, the details of the situation begin to evaporate as more time passes between when the incident occurs and when the incident is documented. Memories fade, and nuance gets lost. When it comes to incidents, nuance makes all the difference. Did you restart that service before this error occurred or after? Did Sandra implement her fix first or was it Brian’s fix first? Did you forget that the service crashed the first time Frank started it back up? What could that mean? All of these little details may not mean much when you’re attempting to figure out what solved the issue, but they definitely matter in terms of understanding how the incident unfolded and what you can learn from it.

Another reason to do the postmortem within 24 hours is to be sure that you’re leveraging the emotion and the energy behind the failure. If you’ve ever had a near-miss car accident, you become super alert following it. And that level of alertness and intensity will stick around for a certain amount of time. But sooner or later, you begin to fall back into your old habits. Before long, the sense of urgency has faded, and you’re back to driving without a hands-free unit and responding to text messages while you’re at stoplights.

Now imagine if you could instead use that short period of heightened awareness to put real controls in your car that prevent you from doing those poor or destructive actions in the first place. That’s what you’re trying to do with the 24-hour rule: seize the momentum of the incident and use it for something good.

When you have an incident, a lot of pent-up energy arises because it’s an event that’s out of the ordinary, and typically someone is facing pressure or repercussions from the failure. But the more time that passes, the more the sense of urgency begins to fade. Get the ball rolling on follow-up items to the incident within the first 24 hours while it’s still something of note to your team members.

Lastly, having the postmortem within 24 hours helps to ensure that a postmortem document gets created. Once the documents are created, they can be widely circulated for others to learn about the failure and can serve as a teaching tool for engineers of the future. Again, incidents have mounds of information in them, so being able to document a failure in enough detail could go a long way to training engineers in the future (or to use the incident as an interesting interview question for future engineers).

9.1.3 Setting the rules of the postmortem

As for any meeting, certain guidelines need to be set forth in order to have a successful postmortem. It’s important that you walk through these rules, in detail, prior to any postmortem meeting. The rules are designed to create an atmosphere of collaboration and openness.

Participants need to feel at ease admitting to gaps in their knowledge or their understanding of the system. There are plenty of reasons that team members might not feel comfortable sharing this lack of expertise. The company culture might shun those who display even the slightest hint of lacking complete expertise. The company culture might demand a level of perfection that’s unrealistic, leading to team members who make mistakes or who aren’t complete experts on a topic feeling inadequate.

It’s also not uncommon for team members, for reasons of their own, to have this feeling of inadequacy. These negative emotions and experiences are blockers to total learning and understanding. You need to do your best to try to put those emotions to rest. That’s the goal of these rules and guidelines:

  • Never criticize a person directly. Focus on actions and behaviors.

  • Assume that everyone did the best job they could with the information that was available to them at the time.

  • Be aware that facts may look obvious now but could have been obfuscated in the moment.

  • Blame systems, not people.

  • Remember that the ultimate goal is understanding all the pieces that went into the incident.

These rules will help focus the conversation on where it belongs--improving the system--and will hopefully keep you out of the finger-pointing blame game. It’ll be up to the meeting facilitator to ensure that these rules are always followed. If someone breaks the rule, even once, it can serve as a signal to the other participants that this is just like any other meeting where management is looking for a “throat to choke.”

9.2 The incident

It’s 1:29 a.m. The monitoring system has detected that one of the background-work queues has exceeded its configured threshold. Shawn, the operations on-call engineer, is sound asleep when he receives a page around 1:30 a.m. The alert reads, “Worker processing queues are unusually high.” When Shawn reads the page, it sounds more like a status than an actual problem. Based on what the alert says, he doesn’t feel there’s any risk in waiting for the alert to clear. He acknowledges the alert and snoozes it for 30 minutes, hoping that will be the end of it.

After 30 minutes, the alert pages again, only now the queue has grown even larger in size. Shawn isn’t really sure what the queue that is alerting is used for. He’s aware that several background processing jobs operate off these worker queues, but each job handles from a different queue. He opts to restart the report queue jobs he is aware of and see if that clears the problem. It doesn’t. Two queues are still reporting an extraordinarily large queue size. He confirms that these numbers are excessive by comparing the queue size to historical graphs of the same queue.

At this point, Shawn decides he needs to page an on-call development engineer. He navigates to the confluence page where the on-call information saved. To his dismay, the on-call engineer doesn’t have a phone number listed, just an email address. The on-call page doesn’t list who should be contacted in the event the primary on-call person isn’t available. Rather than start randomly dialing the phone numbers on the list, Shawn opts to just escalate to his own manager for guidance. His manager logs in and assists with troubleshooting, but quickly exhausts his understanding of the system as well. The manager decides to escalate to the principal engineer.

The principal engineer receives the call and hops online to begin investigating. The consumer_daemon is a background processor responsible for processing one of the two queues identified earlier by Shawn. The principle engineer discovers that the consumer_daemon has not been running for a couple of hours. The queues continued to grow as other workers added to the queue, but with the consumer_daemon not running, nothing was taking messages off the queue and processing them. The engineer restarts the consumer_daemon, and processing begins to pick up. Within 45 minutes, the system is back to normal.

9.3 Running the postmortem

Running the postmortem meeting can be a bit of a grueling experience. It usually takes a mix of skills to get the most out of a postmortem. You don’t need full representation for it to be useful, but you’ll find that there’s way more value in expanding the team pool in order to increase diverse perspectives.

9.3.1 Choosing whom to invite to the postmortem

I’m going to start this section with some of the technical roles that should be in attendance. But whatever you do, please don’t think of the postmortem as a purely technical affair. A lot of context goes into some of the decision-making problems that occur during an incident. Even if people don’t contribute directly to the incident, stakeholders often have an interest in understanding what happened and the age-old, inappropriate question of “How do we prevent this from ever happening again?”

The start of your invite list should be all of the people immediately involved with the incident recovery process. If they were involved in the recovery effort at all, I recommend they be in attendance. But in addition, other people could benefit from the postmortem who are typically overlooked.

Project managers

It’s not uncommon for project managers to have a vested interest in the incidents that are occurring in the environment. For starters, they almost always are sharing technical resources with the day-to-day responsibilities of running a production environment. Understanding the underlying technical issues while also having a firsthand account to assess the impact to other projects can be beneficial.

Project managers can also communicate the impact that the incident had on existing projects and resources. Understanding the impact to other work helps you understand the ripple effects of an incident. It’s also not uncommon for a project manager’s timeline to have created some urgency around solving a problem. That sense of urgency could have led to a feature, product, or task being performed in extreme haste, opening the door to create the failure conditions that lead to the incident.

Business stakeholders

Business stakeholders may not fully understand all the technical jargon that gets thrown around during a postmortem, but they will home in on specific details that could shed some light on how incidents should be run in the future. Business stakeholders can translate what the technical details mean for the business and help put the incident into a business outcomes view.

An incident at 9 p.m. on a Tuesday, when user activity is relatively low, might seem like a low-impact incident. But the business can tell you that this particular Tuesday was the month-end closing process. Because of the outage, analysts couldn’t complete their work for the closing process, so bills will go out late, which means accounts receivable will be delayed, which can lead to a cash flow problem. This is a bit of an exaggeration, but it’s not too far from the realities that ill-timed incidents can create. Having the business stakeholder in the room can help to give context as well as transparency to the incident management process.

Human resources

This category is an interesting one based on previous experiences. I don’t recommend inviting HR to all your postmortems, but I’ve definitely done this when I knew one of the contributing factors was resources and staffing.

Having an HR representative listen to an incident unfold and all of the pain points that happen because you simply don’t have enough staff can be eye-opening. Choose your battles wisely, but I have definitely received additional headcount in the past after having an HR staff member listen to our inability to resolve an issue because the on-call rotation was too small and a key staff member was sleeping because of a long-running migration project the night before.

9.3.2 Running through the timeline

The timeline of the incident is a documented series of events that occurred during the incident. The postmortem will run much smoother if everyone can agree on the series of events that occurred, as well as the order and the time at which they occurred.

Whoever is running the postmortem should attempt to assemble a rough timeline prior to the meeting as a starting point. If you must construct the timeline from scratch in the meeting, you’ll spend an inordinate amount of time as everyone attempts to rack their brains remembering what happened. If the postmortem organizer creates a starting point, it serves as a prompt for all the participants in the incident. With those prompts being called out, smaller, more intricate details tend to rise to people’s memory a bit easier.

Detailing each event in the timeline

As the postmortem organizer, each event on your timeline should have a few bits of information:

  • What action or event was performed?

  • Who performed it?

  • What time was it performed?

The description of the action or event that was performed should be a clear, concise statement about what transpired. The details of the action or event should be devoid of any color, commentary, or motivations at this point. It should be purely factual--for example, “The payment service was restarted via the services control panel.” This is a clear statement of fact.

A poor example would be “The payment service was restarted incorrectly and unintentionally.” This adds arguable aspects to the action. Incorrect by whose standards? Where were those standards communicated? Was the person who performed the restart trained incorrectly? By removing these words of judgments, you can keep the conversation on track instead of derailing into a pedantic discussion on how the action has been categorized. That’s not to say that this color isn’t important; it is, and I’ll get to it momentarily. It’s just not beneficial for this part of the process.

Who performed the action is another fact to document. In some cases, the event might have been performed by the system itself and not by a user. For example, if the action or event is “The web server ran out of memory and crashed,” the person who performed the action would just be the name of the server.

How granular you get with that can be up to you. I normally just describe the who as the application and the component of the application that it belongs to. The who might be “payments web server.” In some cases, you might want to be even more specific, such as “payments web server host 10.0.2.55” in order to call out the node in detail, especially if multiple nodes of the same type are behaving in different ways that are contributing to the problem at hand. The detail will ultimately depend on the nature of the issue you’re dealing with.

If the who in this case is a person, you can note either the person’s name or that person’s role. For example, you might say “Norman Chan,” or you might simply say “systems engineer 1.” The value of using a role is that it prevents someone from feeling blamed in the postmortem document. If someone made an honest mistake, it feels a little punitive to repeat that person’s name over and over again in a document detailing that mistake and the problems it created.

Another reason for using a role or title is that the entry will maintain its usefulness over time. These documents become a matter of record that future engineers will hopefully use. An engineer three years from now may have no idea who Norman is or what his job role or function was. But knowing that a system engineer performed the actions specified sets the context around the change more clearly. Whereas I might ask if Norman was authorized to perform the action he took that’s detailed in the postmortem, I’m well aware that “systems engineer 1” was authorized to do it, because I’m aware of that role, its permissions, and its scope and responsibilities.

Lastly, detailing the time of the event is necessary for purposes of establishing when the events occurred and ensuring that the order of actions is well understood across the team.

With each event detailed in this manner, walk through the timeline, soliciting confirmation from the team on the specifics, as well as confirming that no other activity occurred in between that might have been missed. Circulating the timeline ahead of the meeting, giving people a chance to review it, helps to speed up this process a bit, allowing the timeline to be updated offline, but if you have to do it in the meeting, so be it. Now that the timeline has been established, you can begin to walk through specific items to gain clarity.

Here’s an example of how you would document the example incident from section 9.3:

  • At 1:29 a.m., the monitoring system detected that the background work queue was above configured thresholds.

  • At 1:30 a.m., the system paged the on-call operations engineer with an alert that read “Worker processing queues are unusually high.”

  • At 1:30 a.m., the on-call engineer acknowledged the alert and snoozed it for 30 minutes.

Adding context to the events

With the timeline established, you can begin adding a bit of context to the events. Context provides additional details and motivations behind each of the events that occurred. Typically, you’ll need to add context only to events or actions that were performed by a human, but sometimes the system makes a choice, and that choice needs to be given clarity. For example, if a system has an emergency shutdown mode, explaining the context for why the system went into an emergency shutdown might be beneficial, especially when it’s not clear why such a drastic action had to be taken.

Context around events should be given with the motivation of understanding a person’s mental model of the situation at hand. Understanding the mental model will help explain why a decision was made and the underlying assumptions that went into that decision. Remember to respect the rules of the postmortem and avoid passing judgment on someone’s understanding or interpretation of how the system operates. Your goal is to learn, because chances are, if one person has a misunderstanding about it, then many people have a misunderstanding.

Take a look at each event and ask probing questions about the motivation behind that decision. Some ideas around probing questions are as follows:

  • Why did that feel like the right course of action?

  • What gave you that interpretation of what was happening in the system?

  • Did you consider any other actions, and if so, why did you rule them out?

  • If someone else were to perform the action, how would they have had the same knowledge you had at the moment?

When forming questions like this, you might notice something interesting: they don’t presume that the person’s action was right or wrong.

Sometimes in a postmortem you can learn just as much from someone performing the right action as the wrong action. For example, if someone decided to restart a service and that is the action that resolved the incident, it’s worthwhile to understand what the engineer knew that led them to that course of action. Maybe they knew that the failing tasks were controlled by this one service. Maybe they also knew that this service is prone to flakiness and sometimes just needs to be restarted.

That’s great, but then the question becomes, how do other engineers get that knowledge? If they did have that suspicion, how could they confirm it? Is it purely just experience, or is there a way to expose that experience in a metric or dashboard that might allow someone to verify their suggestion? Or better, maybe there’s a way to create an alerting mechanism to detect that failed state and inform the engineer? Even when someone takes a correct action, understanding how they came to that correct action can be valuable.

Another example from a real-world incident occurred when a database statement as part of a deployment was running long. The engineer performing the deployment recognized that it was running long and started the troubleshooting process. But how did he know the command was running long? What was “long” in this context? When asked about that, it was because he had run the same statements as part of the staging environment deployment and had a rough idea of how long that took in the previous iteration. But what if he wasn’t the engineer to perform the production deployment? That context would be lost, and the troubleshooting effort might not have started for a great deal longer.

Getting to these sorts of assumptions is the heart of the postmortem process. How do you make improvements on sharing this expertise that people collect over the years and rely on heavily in their troubleshooting process? In the preceding example, the team decided that whenever a database statement for a deployment was run in staging, it would be timed and recorded in a database. Then when the same deployment ran in production, prior to executing the statement, the system would inform the deployment engineer of the previous timed runs in staging, giving this engineer some context for how long things should take. This is an example of taking something that was successful during an incident, understanding why it was successful, and making sure that the success can be repeated in the future.

An example from our incident might be asking the principal developer why he opted to restart the consumer_daemon. Here’s what a sample conversation might look like in this scenario:

Facilitator: “Why did you decide to restart the consumer_daemon?”

Principle engineer (PE): “Well, when I logged on to the system, I recognized that one of the queues in question had a naming convention that corresponded to the consumer_daemon.”

Facilitator: “So all of the queues follow a naming convention?”

PE: “Yes, the queues are named in a structured format so you have an idea of who the expected consumer of that queue is. I noticed that the format suggested the consumer _daemon. I then looked for logs from the consumer_daemon and noticed there were none, which was another hint.”

Facilitator: “Oh. So, if the Ops engineer had known to look at the consumer_daemon logs, that would have been a sign when it came up empty?”

PE: “Well, not quite. The consumerdaemon was logging things, but there’s a particular log message I’d expect to see if it was performing work. The problem is, the log message is kind of cryptic. Whenever it processes a message, it reports about the update to an internal structure called a MappableEntityUpdateConsumer. I don’t think anyone but a developer would have made that connection.”

You can see from this conversation that specific knowledge exists inside the developer’s mind that was crucial to solving this problem. This information wasn’t generally available or well-known to the engineer or to the facilitator. This sort of back-and-forth about an action that the developer took correctly goes to the value of conducting these postmortems.

In the same light, understanding why someone made the wrong decision can also be valuable. The goal is to understand how they viewed the problem from their perspective. Understanding the perspective gives necessary context around an issue.

One year, I went to a Halloween party dressed as The Beast, from the popular story Beauty and the Beast. But when my wife wasn’t standing next to me in her outfit as Belle, people didn’t have the context, so they assumed I was a werewolf. But the moment my wife stood next to me, they instantly were given the missing context, and their perspective on my costume changed completely. Let’s look at what the conversation between the operations on-call engineer and the facilitator of the postmortem says about the context of his decision to acknowledge the alarm and not take action:

Facilitator: “You decided to acknowledge and snooze the alert when you first received it. What went into that decision?”

Ops engineer: “Well, the alert didn’t indicate an actual problem. It just said that the queues were backed up. But that can happen for any number of reasons. Plus, it was late at night, and I know that we do a lot of background processing at night. That work gets dumped into various queues and processed. I figured it might have just been a heavier night than usual.”

This conversation sheds some context on the Ops engineer’s perspective. Without that context, we might assume that the engineer was too tired to deal with the problem or was just generally trying to avoid it. But in the conversation, it’s clear that the engineer had a perfectly viable reason for snoozing the alert. Maybe the alert message should have been better crafted, indicating the potential impact in business terms instead of communicating just the general state of the system. This miscue lead to an additional 30 minutes of troubleshooting time wasted, as well as a further buildup of items that needed to be processed, potentially increasing the recovery time.

It’s not uncommon for people to have a flawed view of the way they think a system behaves as compared to the way it actually behaves. This goes back to the concept of mental models introduced earlier. Let’s add a bit more context from the conversation:

Facilitator: “Did restarting consumer_daemon ever occur to you?”

Ops engineer: “Yes and no. Restarting consumer_daemon specifically didn’t occur to me, but I thought I had restarted everything.”

Facilitator: “Can you explain that in a bit more detail?”

Ops engineer: “The command I used to restart the services gives you a list of Sidekiq services that you can restart. It lists them by queue. consumer_daemon is one of the queues. What I didn’t know was that consumer_daemon is not specifically a Sidekiq process. So, when I restarted all the Sidekiq processes, the consumer_daemon was omitted from that, because it doesn’t run in Sidekiq with all the other background processing. Additionally, I didn’t realize that consumer_daemon was not just a queue, but also the name of the process responsible for processing that queue.”

This context highlights how the Ops engineer had a flawed mental model of the system. It also highlights how the wording on the command he uses to restart the service was also at fault for extending the outage.

Figure 9.2 highlights his expectations of the system versus the reality. You’ll notice that in the engineer’s mental model, the consumer_daemon processes from the p2_queue, when in reality it processes from the cd_queue. Another flaw in the mental model is the engineer presumes that the generic restart command will also restart the consumer_daemon, but in the actual model, you can see that there is a specific consumer_daemon restart command.

Figure 9.2 The engineer’s mental model of consumer_daemon

Because of the way the commands were grouped, the engineer inferred something that wasn’t true--but had no way of knowing that those assumptions were wrong. This might lead us to an action item to fix the wording of the restart service help documentation.

The way you name things will inform people’s mental models. If you have a light switch and above the light switch the words “Fire Alarm” are written in bold red letters, that might change your understanding of what the light switch does. When someone asks you to “turn off all the lights,” you would most likely skip this light switch because its label has altered your understanding of what it does. Mental models of systems are affected much the same way. Identifying these mistakes and correcting them is a key area of focus during the postmortem process.

9.3.3 Defining action items and following up

A postmortem is great for creating additional context and knowledge around an incident. But most postmortems should result in a series of action items that need to be performed. If an engineer made a poor decision because they lacked visibility into a part of the system, then creating that visibility would be a useful action item coming out of the postmortem.

Another bonus of action items in a postmortem is that they can demonstrate an attempt at getting better and improving the reliability of the system. A postmortem that is just a readout of what occurred doesn’t show any proactive measures on getting better at handling issues in the future.

Action items should clearly define and be structured in the format of “who will do what by when.” An action item is incomplete if it doesn’t have these three crucial components. I’ve seen many organizations fail at making concrete progress on follow-up items because the task gets defined way too loosely. For example: “We will implement additional metrics for order processing.” That doesn’t sound like anything actionable to me. “We” is not really defined, and while additional metrics for order processing sounds like a noble goal, the fact that the task has no date means that it has no priority. And when it comes to action items coming out of a postmortem, the urgency of the task fades as time goes on.

Ownership of action items

Getting commitment from the team on action items can be a difficult task. Most people aren’t sitting around looking for additional work to do. When an incident comes about, it can be a challenge to get individuals to commit to having a piece of work done, especially if the work is nontrivial. Asking someone to create a new dashboard is one thing, but asking someone to rearchitect how a work queue system functions is a heavy ask.

Your best path to making progress on these items is to treat them as the different asks that they are. Action items should be separated into short-term and long-term objectives. Short-term objectives should be those tasks that can be performed in a reasonable amount of time and should be prioritized first. Reasonable is obviously going to be a moving target based on the workloads of different teams, but the team representatives should be able to give some idea on what’s realistic and what’s not. Long-term objectives are those items that will take a significant effort and will require some form of prioritization by leadership. Long-term objectives should be detailed enough that their scope and time commitment can be discussed with leadership. You’ll want to be sure to capture the following in your notes:

  • Detailed description of the work that needs to be performed

  • Rough time estimate of how long the team thinks the work will take to be performed.

  • The decision-maker responsible for prioritizing the work

Once your list has been separated into short-term and long-term objectives, it’s time to get commitment on the short-term items first. As discussed previously, each item should have a concrete owner, the person who is going to perform the task and negotiate a due date for that action item. Remember to take into account that this is new, unplanned work being added to someone’s already existing workload. Having a date a bit into the future is better than not having a date at all. Offer some flexibility and understanding when team members commit to dates. Everyone’s assumption will be that these things need to be done immediately, and although that’s the preferred outcome, having the work done in five weeks is better than leaving the meeting in a stalemate, with no dates committed.

After your short-term list of action items has been filled out, move to the long-term objectives. Whereas short-term objectives are translated directly into action items, long-term objectives have more of an intermediate step. It isn’t possible to assign ownership and due dates directly to a long-term action item because of its scope. But if you leave it as is, the item won’t go anywhere.

Instead of having an action item for completing the work, the action item owner becomes responsible for advocating for the work to be complete. Who is going to submit the request through the prioritization process, by when? That action item owner will handle the follow-up of the request through the prioritization process until it’s scheduled and in progress by the team to ultimately resolve it.

Now you should have a complete list of action items along with detailed information about long-term objectives. Table 9.1 shows an example readout of the information.

Table 9.1 A list of action items from the postmortem

Action item

Owner

Due date

Update the restart script to include consumer_daemon.

Jeff Smith

EOD--Friday April 3, 2021

Submit request for detailed logging in consumer_daemon.

Jeff Smith

EOD--Wednesday April 1, 2021

Long-term objective

Estimated time commitment

Decision-maker

The logging for consumer_daemon is inadequate. It requires a rewrite of the logging module.

2-3 weeks

Blue team’s management

Follow up on action items

Putting someone’s name to an action item doesn’t guarantee that the owner of the item will get it done in a timely fashion. There are a ton of competing forces on a person’s time, and items get dropped all the time. As the organizer of the postmortem, however, it’s up to you to keep the momentum going toward completion.

During the postmortem, the team should agree to a cadence of updates from the group at large. Assigning each action item a ticket in any work-tracking system you use can work well. This system helps make the work of the action items visible to everyone as well as provide a method for people to check the status on their own.

The postmortem facilitator should then be sending out updates at the agreed-upon frequency to the postmortem team. The facilitator should also reach out to team members who have missed agreed-upon deliverable dates to negotiate new dates.

You should never let an item continue to remain on the list of incomplete action items if it has an expired due date. If an action item is not complete by the agreed-upon date, the facilitator and the action item owner should negotiate a new due date for the task. Keeping task due dates current helps give them a sense of importance, at least relative to other items in a person’s to-do list.

I can’t understate the importance of following up on completing postmortem action items. These types of action items always get caught up in the whirlwind of day-to-day activity and quickly fall in the priority hierarchy. The follow-ups help keep these action items afloat, sometimes just for the sake of not receiving these nagging emails!

If you fail to make progress on an item, it might be worthwhile to document the risk that not completing the item creates. For example, if an action item to fix a poorly performing query never seems to get prioritized and or make progress, you can document that as an accepted risk as part of the incident. This way, you can at least propose to the group that the issue isn’t deemed important enough (or has a low likelihood of being repeated) and isn’t worth the effort of completing when compared to other demands on the team’s time. But it’s important that this be documented in the incident so that everyone is agreeing and acknowledging that the risk is one the team is willing to accept.

This isn’t always a failure! Sometimes accepting a risk is the correct business decision. If a failure has a 1% chance of occurring, but is going to require an outsized effort by the team, accepting that risk is a perfectly reasonable alternative. But teams often get hung up is on who gets to decide that a risk should be accepted. Proper communication and a group consensus must be part of that risk acceptance.

9.3.4 Documenting your postmortem

Writing down your findings from a postmortem carries tremendous value. They serve as a written record for communication to people outside the immediate postmortem team, as well as a historical record for future engineers who encounter similar problems. You should expect the audience of your postmortem to have a mix of skill sets in the organization. You should write it with other engineers in mind, so be prepared to provide low-level details. But if you structure the document appropriately, there are ways to provide high-level overviews for people who aren’t as technically savvy as engineers.

Keeping the structure of postmortem documentation consistent helps to maintain the quality of the postmortems. If documentation follows a template, the template can serve as a prompt for the information you need to provide. As you move through the postmortem document, the information within in should get more detailed as you progress through it.

Incident details

The first section of the postmortem document should contain the incident details. Here you should outline these key items:

  • Date and time the incident started

  • Date and time the incident was resolved

  • Total duration of the incident

  • Systems that were impacted

This list doesn’t need to be written in any sort of prose format. It can be presented as a bullet list at the top of the page. Having this information at the very top enables you to easily locate it when you’re looking for help regarding incidents generally. Once you’re no longer thinking of the specific details of an incident, you may want to search this documentation as you’re looking for information in the aggregate.

An even better solution would be to also add this information to some sort of tool or database that is reportable. Something as simple as an Excel document could make summarizing a lot of this data easier. A database would offer the most flexibility, but for this section, I focus on basic paper-based documentation.

Incident summary

The incident summary is the section where a formal, structured writing of the incident should live. This section should provide high-level details of the event with context that doesn’t get too deep into the specifics. Think of this as an executive summary: people who are not technical can still follow the reading and understand the overall impact of the incident, what the user experience was like during the incident (if applicable), and how the incident was ultimately resolved. The goal is to keep the incident summary under two or three paragraphs if possible.

Incident walk-through

The incident walk-through is the most detailed section of the postmortem report. Your target audience in this section is specifically other engineers. It should provide a detailed walk-through of the incident timeline that was created during the postmortem meeting. The detailed report should not only walk through the decision process behind taking a particular action, but also provide any supporting documentation such as graphs, charts, alerts, or screenshots. This helps give the reader the context of what the engineers participating in the incident resolution experienced and saw. Providing annotations in screenshots is extremely helpful too. In figure 9.3, you can see with the big red arrow how pointing out problems in the graph gives context to the data as it relates to the incident.

Figure 9.3 Share and annotate graphics in your postmortem.

Even though the walk-through section is intended for other engineers, it shouldn’t presume the experience level or knowledge of an engineer. When explaining key technical aspects of the incident, it’s worthwhile to give a primer on the underlying technology that is leading or contributing to the problem. This doesn’t have to be exhaustive, but it should be enough for the engineer to understand at a high level what’s occurring or, at the very least, provide enough context so that they can begin to research it on their own.

As an example, if the incident being reported on was related to an excessive number of database locks, a brief explanation of how database locking works might be in order for the postmortem document to be a bit clearer to readers. The following is an example from an actual postmortem report of how you might segue into the details of an issue:

In production, a number of queries were executing that are not typically running around the time of the deployment. One query in particular was a long-running transaction that had been around for a while. This long-running transaction was holding a READ LOCK on the auth_users table. (Unfortunately, we were not able to capture the query during the troubleshooting phase of the incident. We did confirm that it was coming from a Sidekiq node.)

Transactions and Locking

When a transaction is running, the database will acquire locks on tables as those tables are accessed. All queries will generate some kind of lock. A simple SELECT query will generate a READ lock during the lifetime of that query. This, however, changes when a query is executed inside a transaction. When a query is made inside a transaction, the lock it acquires is maintained for the lifetime of the transaction. Using this nonsensical transaction below as an example:

BEGIN TRANSACTION

SELECT COUNT(*) from auth_users;

SELECT COUNT(*) from direct_delivery_items;

 

COMMIT

The auth_users table would have a READ LOCK for the entire time that the direct _delivery_items query was executing. Considering the size of direct_delivery _items, this could be a lock for more than 10 minutes, even though the lock is not needed from an application perspective. This is basically what was transpiring the day of the outage. A long-running query had a READ LOCK on auth_users, which prevented the ALTER TABLE statement from acquiring a lock.

Cognitive and process issues

This section for cognitive and process issues should highlight the things that the group has identified as areas for improvement. Consider all of the areas where people’s mental models were not correct. Maybe the documented process for handling the incident missed a step. Or maybe the process didn’t take into account this particular failure scenario. It could be something as simple as how the incident was managed or something more specifically technical, like how the database failover was performed.

This section isn’t about creating and assigning blame, but about identifying the key areas that helped contribute to the failure. Pointing out these items in a bulleted list is enough, with a bit of detail around it as well.

Action items

The final section should be nothing more than a bulleted list of the open action items that have come out of the postmortem meeting. The bulleted list should detail the components of all the action items: who will do what by when.

9.3.5 Sharing the postmortem

Once the postmortem has been completed, the last thing to do is to share it with the rest of the engineering organization. You must have a single location where all postmortems are stored. You can categorize those postmortems into groupings that make sense for your organization, but there should be a single location where all postmortems and the subsequent categories can be found. It can be difficult to categorize postmortems to some degree, because system failures are seldom isolated. They can have rippling effects across a platform, so a failure in one subsystem could cause failures in additional subsystems.

Many documentation systems use metadata or labels to help aid in categorizing information. The labels serve as pieces of additional information that are typically used in search engines. But with labels, you’re able to find different document types that might relate to the same subject, regardless of the document’s name or title. If the documentation system you’re using allows for labels or other forms of metadata to add to the document, you might be better off not creating a category or hierarchy at all and instead using the metadata options to detail documents. This allows you to have a document labeled with many keywords, so that if it does relate to multiple systems or departments, you can just add a label for each of the areas that’s impacted.

It’s also preferable that your documents follow a naming convention. The convention can vary based on your organization, but I strongly recommend that the first component of the document name be the date that the incident occurred. So, for example you’d name a postmortem “01-01-2019 - Excessive DB locking during deployment.” This gives a brief summary of the event, but at the same time, the date allows people looking for a particular incident to home in on the correct document with relative ease.

Lastly, when it comes to sharing postmortems, try to avoid restricting access to the documents if possible. You want that information to be widely read for communication purposes, and so that everyone understands the expectations for conducting these postmortems. Making the documents available to only a select group of people might send the wrong signal that only those select people are responsible for writing postmortems.

Summary

  • Blameful postmortems are not effective.

  • Understand the engineer’s mental model of the system to better understand decision-making.

  • Action items should be defined as who will do what by when.

  • Document the postmortem with different audiences intended for different sections.

  • Have a central location for sharing all postmortems with the team.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.211.66