When to write a postmortem document

One question I often get from people when starting to document postmortems in an organization is "should I write a postmortem?" There are many different ways that organizations decide on when to write a postmortem, but I have a simple rule I like to follow—if someone asks for a postmortem, you should write one. In a large organization, this works great, because someone will often ask, "What was that incident last night all about?" or "Will there be a postmortem for last week's outage?" These are cues that a document should be created.

In a smaller organization, it is more difficult because, often, it will be you deciding whether you should document the incident. To determine whether you should, you could come up with an internal metric. For example, saying, "If it took us more than 10 minutes to recover, we should write a postmortem." You can also keep it casual and only postmortem things that feel abnormal.

This can be dangerous though. Over the years I have developed a bit of a gut feeling for when something is postmortem-worthy. On reflection, though, I have found that because of this I do not document some of the smaller outages I am involved in, but I do document similar-sized outages that other people are involved in. This realization tells me that I should come up with a more specific metric for when I should write a postmortem. The metric I have been using recently to solve this is two-fold:

  1. If the incident happens twice within a few days, I should write a postmortem
  2. If we are down for more than 30 minutes, I should write a postmortem

An example of the first one might be a bad code push, followed by a rollback, followed by another release that didn't fix the initial bad code push correctly. Another example is a series of bad configuration changes that cause similar and related outages.

An example of the second one might be a 30-minute period where an application's SLI is outside of normal bounds. See Chapter 2, Monitoring for a description of SLIs. Another example could be if the customer couldn't access the app for 30 minutes.

I use these metrics because they match the level of severity that First Look Media (my employer at the time of writing) is currently operating with. We have smaller incidents due to bugs, or carelessness, or other things. At the speed we are operating at, we cannot stop and document everything, but incidents that hit these metrics seem to be the same sort of incidents that concern a broader number of people. When multiple people (engineers, product folks, customers, and so on) are concerned, we have to manage their expectations and produce documentation to calm them, and also reach out to them and make sure our service fills them with joy, not dread.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.218.215