Chapter 3. Alerts, On-Call, and Incident Management

Alerting is one of the most crucial parts of monitoring that you will want to get right. For whatever reason, infrastructure likes to go sideways in the middle of the night. Why is it always 3 a.m.? Can’t I have an outage at 2 p.m. on a Tuesday? Without alerts, we’d all have to be staring at graphs all day long, every day. With the multitude of things that could possibly go wrong, and the ever-increasing complexity of our systems, this simply isn’t tenable.

So, alerts. We can all agree that alerting is an important function of a monitoring system. However, sometimes we forget that the purpose of monitoring isn’t solely to send us alerts. Remember our definition:

Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.

Alerts are just one way we accomplish this goal.

Great alerting is harder than it seems. System metrics tend to be spike-y, so alerting on raw datapoints tends to produce lots of false alarms. To get around that problem, a rolling average is often applied to the data to smooth it out (for example, five minutes worth of datapoints averaged into one datapoint), which unfortunately causes us to lose granularity, resulting in occasionally missing important events. There’s just no winning, is there?

One of the other reasons alerting is so difficult to do well is because you often want alerts them going to a human, and we humans have limited attention. You’d rather spend it on problems of your choosing and not on the monitoring system sending you a text that something is on fire. Every time you get an alert, a little bit more of your attention is claimed by the monitoring system.

In this chapter, we’ll cover a few tips on creating better alerts, the trials and tribulations of on-call, and close out with a bit about incident management and postmortems.

What Makes a Good Alert?

With your multitude of alerts that sometimes are helpful, sometimes aren’t, and sometimes simply make no sense, how should you reconstruct them to be good? What does a good alert even look like?

Before we can answer that question, let’s make a distinction. I’ve found that when people talk about alerts, they really mean two different things, depending on the context:

Alerts meant to wake someone up

These require action to be taken immediately or else the system will go down (or continue to be down). This might mean phone calls, text messages, or alarms. Example: all your web servers are unavailable, and your company’s main site is no longer reachable.

Alerts meant as an FYI

These require no immediate action, but someone ought to be informed that they occurred. Example: an overnight backup job failed.

The latter may lead to the former. For example, if your systems are capable of auto-healing, then an auto-healing action might just be a message dropped in a log file. If the auto-healing fails, then you might send a message to the on-call person, expecting immediate action.

For our purposes, the second type of alert isn’t actually an alert: it’s a message. We’re going to be talking mainly about the former here. An alert should evoke a sense of urgency and require action from the person receiving that alert. Everything else can essentially be a log entry, a message dropped in your internal chat room, or an auto-generated ticket.

So with that understanding, we’re back the original question: what makes a good alert? I’ve rounded up six practices I think are key to building great alerts:

  • Stop using email for alerts.

  • Write runbooks.

  • Arbitrary static thresholds aren’t the only way.

  • Delete and tune alerts.

  • Use maintenance periods.

  • Attempt self-healing first.

Let’s dig deeper into how these impact your alerting strategy and how you can leverage them for improvement.

Stop Using Email for Alerts

An email isn’t going to wake someone up, nor should you expect that it would. Sending alerts to email is also a great way to overwhelm everyone with noise, which will lead to alert fatigue.

What should do you do instead? Think about what sorts of use cases each alert will have. I’ve found they fall into one of three categories:

Response/action required immediately

Send this to your pager, whether it’s an SMS, PagerDuty, or what-have-you. This is an actual alert, per our definition.

Awareness needed, but immediate action not required

I like to send these to internal chat rooms. Some teams have built small webapps to receive and store these for review with great success. You could send these to email, but be careful—it’s easy to overwhelm an inbox. The other options are usually better.

Record for historical/diagnostic purposes

Send the information to a log file.

With proper attention given to the purpose and response required of the alert, you can easily lower the noise level of every alert you have.

Write Runbooks

A runbook is a great way to quickly orient yourself when an alert fires. In more complex environments, not everyone on the team is going to have knowledge about every system, and runbooks are a great way to spread that knowledge around.

A good runbook is written for a particular service and answers several questions:

  • What is this service, and what does it do?

  • Who is responsible for it?

  • What dependencies does it have?

  • What does the infrastructure for it look like?

  • What metrics and logs does it emit, and what do they mean?

  • What alerts are set up for it, and why?

For every alert, include a link to your runbook for that service. When someone responds to the alert, they will open the runbook and understand what’s going on, what the alert means, and potential remediation steps.

Warning

As with many good things, runbooks can be easy to abuse. If your remediation steps for an alert are as simple as copy-pasting commands, then you’ve started to abuse runbooks. You should automate that fix or resolve the underlying issue, then delete the alert entirely. A runbook is for when human judgment and diagnosis is necessary to resolve something.

I’ve included an example runbook in Appendix A.

Arbitrary Static Thresholds Aren’t the Only Way

Nagios got all of us used to the idea of using arbitrary static thresholds for alert criteria, and this is our loss. Not every situation has a warning and critical state that makes sense (I’d argue that most don’t). Furthermore, there are a lot of situations where alerting on such things as “datapoint has crossed X” isn’t useful at all. For example, the quintessential case for this is disk usage: if I have a static threshold set at “free space under 10%,” then I’m going to miss a disk quickly growing in usage from 11% used to 80% used overnight. You know, that kind of thing is something I’d really want to know about, but my static threshold wouldn’t tell me.

There are plenty of other options available here. For example, using a percent change/derivative would handle our disk usage problem nicely by telling us “disk usage has grown by 50% overnight.”

With a bit more capable metrics infrastructure (e.g., Graphite), we could even apply some statistics to the problem, using various approaches such as moving averages, confidence bands, and standard deviation. We’ll go into more about statistics and how they could be applied in Chapter 4.

Delete and Tune Alerts

Noisy alerts suck. Noisy alerts cause people to stop trusting the monitoring system, which leads people to ignoring it entirely. How many times have you looked at an alert and thought, “I’ve seen that alert before. It’ll clear itself up in a few minutes, so I don’t need to do anything”?

The middle ground between high-signal monitoring and low-signal monitoring is treacherous. This is the area where you’re getting lots of alerts, some actionable and some not, but it’s not to the point that you don’t trust the monitoring. Over time, this leads to alert fatigue.

Alert fatigue occurs when you are so exposed to alerts that you become desensitized to them. Alerts should (and do!) cause a small adrenaline rush. You think to yourself, “Oh crap! A problem!” Having such a response 10 times a week, for months on end, results in long-term alert fatigue and staff burnout. The human response time slows down, alerts may start getting ignored, sleep is impacted—sound familiar yet?

The solution to alert fatigue is simple on its face: fewer alerts. In practice, this isn’t so easy. There are a number of ways to reduce the amount of alerts you’re getting:

  1. Go back to the first tip: do all your alerts require someone to act?

  2. Look at a month’s worth of history for your alerts. What are they? What were the actions? What was the impact of each one? Are there alerts that can simply be deleted? What about modifying the thresholds? Could you redesign the underlying check to be more accurate?

  3. What automation can you build to make the alert obsolete entirely?

With just a little bit of work, you’ll find that your alert noise will be cut back significantly.

Use Maintenance Periods

If you need to do work on a service, and you expect it to trigger an alert (e.g., due to it being down), then set that alert into a maintenance period. Most monitoring tools support the concept, which is a simple one: if you’re working on the thing that alert is watching, and you know your work is going to cause an interruption, there’s no sense in having the alert go off. An alert firing is just distraction, especially for your teammates who may not immediately know that you’re working on it.

Warning

Be careful not to silence too many alerts. I can’t even begin to count the number of times that I’ve been working on something and found a previously unknown dependency that caused some other service to start to have problems. Such a scenario is actually desirable, as it reveals things about your infrastructure that you may not have known before, or it can warn you that something might be going sideways with the maintenance work you’re doing. Issuing a wide, blanket silence can cause more problems than it solves.

Attempt Automated Self-Healing First

If the most common action needed on an alert is to perform a known and documented series of steps which usually fixes the problem, why not let a computer do the work? Auto-healing is a great approach to avoiding alert fatigue, and when you’re managing a large environment, it’s not really optional (hiring more staff gets expensive!).

There are several ways you can implement auto-healing, but the most common and straightforward approach is to simply implement any standardized fix into code and let your monitoring system execute the script instead of notifying a human. If the problem wasn’t resolved via an auto-healing attempt, then send an alert.

On-Call

Ah, good old on-call. Many of you reading this book have probably been on-call at some point in your career, even if it was unofficial. For those that haven’t, on-call is where you are expected to be available to respond to pages about things going wrong. If that’s you at all times, you’re always on-call (this is bad, but we’ll talk about that later).

For those with on-call experience, you know how terrible on-call can be.1 You’re plagued by false alarms, unclear alerts, and constant firefighting. After a few months, you start experiencing the effects of burnout: irritability, sleep deprivation, anxiety, and more.

It doesn’t have to be this way, though, and I want to show you how to fix it. While we can’t avoid computers doing silly things in the middle of the night, we can avoid getting needlessly woken up about it. Let’s talk about what we can do about it.

Fixing False Alarms

False alarms, for many us, are just an everyday fact of life when it comes to monitoring. 100% accuracy in alerting is a really hard problem—one that is still unsolved. While tuning alerts isn’t always easy, many of you will be able to cut false alarms by an appreciable amount. At any rate, even if you never achieve 100% accuracy, you should still strive for it.

Here’s an easy way to always tune the alarms: as part of the duties of the person on-call, compile a list of every alert that fired for the previous day. Go through them and ask yourself how the alert’s signal can be improved, or if the alert can be deleted entirely. Do this every day you are on-call and soon you’ll be in much better shape than when you started.

Cutting Down on Needless Firefighting

Sometimes it’s not a signal problem, and the alerts are legit. Except, there’s dozens of them a day, and they’re all legit. You have an excessive firefights problem. We talked about this back in Chapter 1.

To quote a colleague who made an apt observation with monitoring, “You gotta fix your shit.”

Monitoring doesn’t fix anything. You need to fix things after they break. To get out of firefighting mode, you must spend time and effort on building better underlying systems. More resilient systems have less show-stopping failures, but you’ll only get there by putting in the effort on the underlying problems.

There are two effective strategies to get into this habit:

  1. Make it the duty of on-call to work on systems resiliency and stability during their on-call shift when they aren’t fighting fires.

  2. Explicitly plan for systems resiliency and stability work during the following week’s sprint planning/team meeting (you are doing those, right?), based on the information collected from the previous on-call week.

I’ve seen both methods work successfully, so I’d recommend trying both and seeing which one works better for your team.

Building a Better On-Call Rotation

You’ve no doubt experienced the unofficial on-call: instead of a formal designation of when you are on-call and when you are not, you were simply always on call. Always being on-call is a great way to burn people out (as I’m sure you already know!), and this is why on-call rotations are a great idea. They’re a tried-and-true method of managing on-call response.

Here’s how a simple rotation might work. Let’s say you have Sarah, Kelly, Jack, and Rich on your team. You set up a four-week rotation, whereby each of them is on-call for one week, starting on Wednesday at 10 a.m. and ending one week later. This rotates through in a specified order until everyone has been on-call for one week and off on-call for three, then repeats itself.

A schedule like this works pretty well and is a great start if you don’t have a rotation schedule already.

It’s important to start the on-call rotation during the workweek instead of tying it directly to a calendar week. This allows your team to do an on-call handoff: the person coming off on-call discusses with the person going on-call what’s in-flight that needs attention, any patterns noticed during the week, etc. I’ve been on teams where we did handoff at 9 a.m. on Monday morning and others where we did it in the afternoon on Wednesdays—I’d pick whatever day and time works best for your team. If you’re not sure, go with Wednesday at 10 a.m.

One question that comes up often about on-call schedules is whether you should have a backup on-call person or not, in addition to the primary on-call person. For most teams, I advise against this unless you have a suitably large team. Having a primary and a backup puts a person on-call for two rotations during the cycle. If you only have a team of four people and use one-week rotations, that means everyone is on-call for two weeks of the month—brutal!

Even if the backup isn’t called, they’re still required to do all the normal primary on-call things: be near a computer with internet, be sober, and so forth, and that just isn’t fair to them (it will also lead to quicker burnout).

This isn’t to say that the on-call person is all alone: you do absolutely need escalation paths available for issues beyond the knowledge and capability of the on-call person to solve. My caution has more to do with expecting everyone to always be available, regardless of whether they are officially on-call or not.

Now, I know what you’re thinking: what if primary on-call doesn’t respond to the alert? That’s fine. It’s the job of the on-call person to respond to alerts, and people are more responsible than they’re often given credit for. If you have a consistent problem with your on-call not responding to alerts, you’ve got a different problem. Otherwise, I wouldn’t worry about it.

That brings us to another point: how many people do you need for an effective on-call rotation schedule? That depends on two factors: how busy your on-call tends to be and how much time you want to give people between on-call shifts.

On-call shifts with only two or three incidents a week can be considered light—what you should be aiming for. The more incidents you have on a regular basis, the more time off you should give between rotations. As for how much time off between shifts, for a normal shift (such as the example I gave), I recommend three weeks between on-call shifts for each person. Assuming you have only a single primary on-call, that means you’re looking at a team of four people. If you want a backup rotation as well, then you need eight people on the schedule.

I strongly encourage you to put software engineers into the on-call rotation as well. The idea behind this is to avoid the “throw-it-over-the-wall” version of software engineering. If software engineers are aware of the struggles that come up during on-call, and they themselves are part of that rotation, then they are incentivized to build better software. There’s also a more subtle reason here: empathy. Putting software engineers and operations engineers together in some way increases empathy for each other, and it’s awfully hard to be upset at someone you genuinely understand and like.

Lastly, augment your on-call with tools such as PagerDuty, VictorOps, OpsGenie, etc. These tools help you build and maintain escalation paths and schedules, and can automatically record your incidents for you for later review. I try to avoid recommending specific tools in this book, but when it comes to tools that help on-call, I really cannot recommend these enough.

With some work, you can significantly improve your on-call experience for everyone involved.

Incident Management

Incident management is a formal way of handling issues that arise. There are several frameworks out there for the tech world, with one of the most popular coming from ITIL:

An unplanned interruption to an IT service or reduction in the quality of an IT service.

ITIL 2011

ITIL’s process for incident management looks something like this:

  1. Incident identification

  2. Incident logging

  3. Incident categorization

  4. Incident prioritization

  5. Initial diagnosis

  6. Escalation, as necessary, to level 2 support

  7. Incident resolution

  8. Incident closure

  9. Communication with the user community throughout the life of the incident

Despite the stilted presentation of it, a formal, consistent method for detecting and responding to incidents provides a certain rigor and discipline to a team. For most teams, formal methods such as this are overkill. However, what if we took the preceding ITIL process and simplified it, to not be so heavyweight?

  1. Incident identification (monitoring identifies a problem).

  2. Incident logging (monitoring automatically opens a ticket for the incident).

  3. Incident diagnosis, categorization, resolution, and closure (on-call troubleshoots, fixes the problem, resolves the ticket with comments and additional data).

  4. Communications throughout the event as necessary.

  5. After the incident is resolved, come up with remediation plans for building in more resiliency.

Hey, that’s not so bad. In fact, I’d bet that a lot of you are doing something very similar to this already, and that’s great. There is value in establishing your incident response as an internal standard, formal procedure for handling incidents: incidents are logged and followed-up on consistently; your users, management, and customers get more transparency and insight into what’s going on; and your team can start to spot patterns and hot spots in the app and infrastructure.

For most incidents that are resolved quickly, this process works well. What about for incidents that are actual outages and last longer than a few minutes? In that case, a well-defined set of roles becomes crucial. Each of these roles has a singular function and they should not be doing double-duty:

Incident commander (IC)

This person’s job is to make decisions. Notably, they are not performing any remediation, customer or internal communication, or investigation. Their job is to oversee the outage investigation and that’s it. Often, the on-call person adopts the IC role at the start of the incident. Sometimes the IC role is handed off to someone else, especially if the person on-call is better suited for another role.

Scribe

The scribe’s job is to write down what’s going on. Who’s saying what and when. What decisions are being made? What follow-up items are being identified? Again, this role should not be performing any investigation or remediation.

Communication liaison

This role communicates status updates to stakeholders, whether they are internal or external. In a sense, they are the sole communication point between people working on the incident and people demanding to know what’s going on. One facet of this role is to prevent stakeholders (e.g., managers) from interfering with the incident by directly asking those working on resolving the incident for status updates.

Subject matter experts (SMEs)

These are the people actually working on the incident.

Warning

One common anti-pattern I’ve seen with incident management roles is for them to follow the day-to-day hierarchical structure of the team or company. For example, the manager of the team is always the IC. The incident management roles do not need to resemble the day-to-day team roles. In fact, I encourage you to have the team’s manager act as communication liaison rather than IC, and allow an engineer on the team to act as IC. These are often a much better fit, as it allows the manager to protect the team from interruption and it puts decision-making power in a person who is best suited to assess risk and trade-offs.

This is only a brief overview of incident management, but if you’re interested in learning more about the topic, I recommend reading PagerDuty’s Incident Response documentation.

Postmortems

I want to devote some special attention to step five from the simplified incident response process above. After an incident has occurred, it’s always advisable to have a discussion about the incident (what happened, why, how to fix it, etc.). For some incidents, especially those concerning outages, a proper postmortem is a great idea.

You’ve likely participated in, or even perhaps led, a postmortem. Essentially, you get all interested parties together and discuss what went wrong, why, and how the team is going to make sure it doesn’t happen again.

There’s a nasty habit in postmortems that I’ve noticed: a blame culture. If you’ve ever been in a team where people were punished for mistakes or people felt compelled to cover up problem areas, you were probably in a blame culture.

If people fear retribution or shaming for mistakes, they will hide or downplay them. You can never fix deep, underlying issues if your actions after an incident are to blame a person.2

Wrap-Up

This jam-packed chapter covered a lot of material about your alerting, on-call, and incident management. To recap:

  • Alerting is hard, but a few key tips will keep you on the right path:

    • Don’t send alerts to email.

    • Write runbooks.

    • Not every alert can be boiled down to a simple threshold.

    • Always be reevaluating your alerts.

    • Use maintenance periods.

    • Attempt automated self-healing before alerting someone.

  • Improving the on-call experience isn’t too difficult with a few tweaks.

  • Building a simplified and usable incident management process for your company should be prioritized.

Now that we’ve gotten alerting and on-call out of the way, let’s move on to everyone’s least favorite class from school: statistics!

1 It’s not a universal truth that on-call sucks. Many companies have amazing and effective on-call experiences. It takes a lot of work to get there.

2 A great book on this topic is Beyond Blame by Dave Zwieback (O’Reilly, 2015).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.19.243