Analyzing past postmortems

Once everything is said and done, it is good to go back and review past postmortems. Once a quarter, or once a year, collect all of the postmortems and try and pull together some metrics. These metrics can help to give you an insight into what your team is doing to respond to incidents:

  • Time to recovery
  • Time between failures
  • Number of alerts fired versus postmortems generated
  • Number of alerts fired per on-call rotation

MTTR and MTBF

Outside of incidents, two metrics that are often talked about are mean time to recovery (MTTR) and mean time between failures (MTBF). Looking at these numbers across a year can show how your ability to respond to incidents is improving or changing. Note how the goal is to minimize the time until recovery, not necessarily to minimize the time until the cause of the outage is fixed. If MTBF is low, it might mean that your team is not investing in testing enough, and this is also probably draining your team. If MTTR is high, it probably means that your team does not understand how to respond to incidents, or the tooling for doing things like rollbacks and other recovery actions is poor or slow.

Alert fatigue

Figuring out what percentage of alerts were actionable is important for the health of those responding. As we mentioned in Chapter 3, Incident Response responders need to know whether, if they get alerted, they need to respond. Having a historical record of alerts that turned into postmortems, alerts per rotation, and actionable alerts can help you to determine whether responders are overwhelmed, and whether they are getting interrupted by things that matter. If you have engineers who know they will get woken up at three in the morning every time they are on call, they may not stick around long or may just stop responding.

Discussing past outages

Once you have collected data about your past postmortems and outages, you should share it with the team. After circulating the data, it can be useful to have one-on-ones or a team meeting where people share their concerns about the current health of incident response. Sometimes the team may feel fine, whereas other times they may feel there is something missing—maybe too many alerts, poor monitoring, or they feel fixes from outages aren't being prioritized enough.

Other times, a single member of the team may be having issues. I remember very clearly that while I was at Google, I was completely burned out because we were a four-person team responding to the alert load a team of six should have been responding to. I worked with my manager to take two months off from being on call. Others weren't as overwhelmed as me, so I appreciated that they were able to cover for me. That time off allowed me to recover, and if my manager hadn't checked in on me, I am not sure I would have taken the time off. Making sure your team is healthy and that incident response is not a burden is important for the long-term survival of developers. This is often not visible at the single postmortem level but can be seen when you look at things in aggregate and talk to your team.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.36.30