Holding a postmortem meeting

A postmortem meeting is the follow-up to the document. It often finalizes action items, discusses the findings of the root cause, and offers a safe setting for discussion. Usually, the best set of people to invite to a postmortem meeting are those involved in the incident—the tech lead for services affected, the product manager for the affected services, and any interested engineers. Finding the right balance is hard, because if the meeting is too large, you will not get much done, but if the meeting is too small, knowledge won't be disseminated well.

It's a good idea to have those involved in the incident present because they will know what happened in case data is missing from the document. Tech leads from affected services should be there in case assumptions are made about their services that aren't true and to accept responsibility to make sure that the action items get implemented. Product managers for affected services are important because they can help to describe business needs related to the service and work with the tech leads to make sure that the fixes get prioritized. Allowing interested people to attend is important to promote knowledge transfer.

The incident commander from the incident should lead the meeting. While the incident commander doesn't always write the entire postmortem, they are often the most knowledgeable about the event, and those involved, and can direct conversation. Open the meeting with a statement requesting that those who have not read the document excuse themselves from the meeting or agree not to participate. The thinking here is that those who have not read the document will ask questions or make statements that might be already answered in the document. The meeting should not cater to those who couldn't take the time to be informed about the incident for the sake of wider learning. Following that, try sticking to the following agenda:

  • A brief timeline and incident summary: This should take no more than five minutes. Remember that, since everyone has read the document, this is more to refresh people and point out any particularly noteworthy facts. You do not need to go minute by minute, so just give broad strokes.
  • Discussion of the root cause: This is usually the most variable of sections, as sometimes people do not understand how the root cause could have been possible. This is also an area where people can get defensive if they feel they are being attacked or blamed. Try to keep discussions civil and try to redirect any accusations of blame toward working together, as we are all responsible for each other's success. Also, be careful not to go too deep into discussions on how to engineer a solution to the root cause. If there is not an obvious fix, a follow-up meeting should be created for engineering a solution, which should be an action item.
  • Questions: This section will often bleed into the previous section, but you can have people pre-submit questions if it is a large group or go over any concerns people might have.
  • Discussion of action items: The previous discussions may have made extra action items, so make sure that all tasks are properly documented. If you have an issue tracker, make sure issues are logged and linked from the document. Also, make sure that someone is responsible for the bugs. We don't want to blame people, but we do want to make sure that fixes get implemented. One of the most annoying feelings is participating in multiple postmortems that all have the same action items but no one makes sure that the fixes get done.

This schedule works great for larger organizations. I have worked on teams where we ran a postmortem meeting once a week and would go through every outage from the past week in this format. In smaller organizations, it is often useful to be less formal and go faster. In these cases, I like a quick discussion about a postmortem or incident:

  • What worked? Go around the room and ask each person to list up to three things that worked during the incident.
  • What didn't work? Have each person state what they felt didn't work well.
  • What should we start doing? This time, have people list one thing they think the team should start doing. This is often a good way to figure out the prioritization of action items or to generate new ones.
  • What should we stop doing? Instead, focus on what the team should stop doing.
  • What should we improve doing? Finally, go around focusing on things to improve.

For this more informal rotation, you should try and only spend a minute or two on each person for each question. That way, the whole process takes less than half an hour for a team of five.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.253.152