6 Alert fatigue

This chapter covers

  • Using on-call best practices
  • Staffing for on-call rotations
  • Tracking on-call happiness
  • Providing ways to improve the on-call experience

When you launch a system into production, you’re often paranoid and completely ill-equipped to understand all of the ways your system might break. You spend a lot of time creating alarms for all the nightmare scenarios you can think of. But the problem with that is you generate a lot of noise in your alerting system that quickly becomes ignored and treated as the normal rhythms of the business. This pattern, called alert fatigue, can lead your team to serious burnout.

This chapter focuses on the aspects of on-call life for teams and how best to set them up for success. I detail what a good on-call alert looks like, how to manage documentation on resolving issues, and how to structure daytime duties for team members who are on call for the week. Later in the chapter, I focus on tasks that are more management focused, specifically around tracking on-call load, staffing appropriately for on-call work, and structuring compensation.

Unfortunately, some of the tips here are geared toward leadership. Notice I didn’t say “management.” Any team member can be the voice and advocate for these practices. And if you’re reading this book, you’re probably going to be that voice to raise some of these points. Although I prefer to focus on empowering tips that anyone can use, the on-call experience is important enough to break away from that restriction just a bit. Since not all readers may have experience with being on call, a brief introduction to the frustrations might be in order.

6.1 War story

It’s 4 a.m., and Raymond receives a phone call from the automated alerting system at his job. Today is the first time Raymond is participating in the on-call shift, so he’s sleeping lightly, worried about this very moment. Looking at the alert message, he sees that the database is at 95% CPU utilization. Frantic, he jumps up and logs on to his computer.

He begins looking at database metrics to see if he can see the source of the problem. All the activity on the system appears normal. Some queries seem to be running for a long time, but those queries run every night, so there’s nothing new there. He checks the website, and it returns in a reasonable amount of time. He looks for errors in the database logs but finds none.

Raymond is at a bit of a loss. Not reacting seems like the wrong move; he was alerted, after all. That must mean that this issue is abnormal, right? Maybe something is wrong and he’s just not seeing it or he’s not looking in the right place. After staying awake for an hour or so, watching the CPU graphs, he notices that they’re slowly beginning to come back to normal levels.

Breathing a sigh of relief, Raymond crawls back into bed and tries to enjoy his last 30 minutes or so of sleep. He’ll need it, because tomorrow he’s going to get the same page. And the night after that, and the night after that for his entire on-call shift.

As I discussed in chapter 3, sometimes system metrics alone don’t tell the whole story. But they’re typically the first thing that comes to mind to build alerting around. That’s because whenever you have a system problem, it’s not unusual for that problem to be accompanied by high resource utilization. Alerting on high resource utilization is usually the next step.

But the problem with alerting on resource utilization alone is that it’s often not actionable. The alert is simply describing a state of the system, without any context around why or how it got into that state. In Raymond’s case, he might be experiencing the regular processing flows of the system, with some heavy reporting happening at night as scheduled.

After a while, the alarms and pages for high CPU utilization on the database become more of a nuisance than a call to action. The alarms become background noise and lose their potency as early warning systems. As I said previously, this desensitization to the alarms is known as alert fatigue.

DEFINITION Alert fatigue occurs when an operator is exposed to many frequent alarms, causing the operator to become desensitized to them. This desensitization begins to lower response times as operators become accustomed to false alarms. This reduces the overall effectiveness of the alerting system.

Alert fatigue can be dangerous from both a system response perspective and from an employee’s mental health and job satisfaction perspective. I’m sure Raymond didn’t enjoy being awakened every morning at 4:00 for no discernable reason. I’m sure his partner was also not pleased with having their sleep disturbed as well.

6.2 The purpose of on-call rotation

Before I get too deep in the woods, a basic definition of the on-call process is in order. An on-call rotation is a schedule of individuals who are designated as the initial point of contact for a system or process. Many organizations have different definitions of what an on-call person’s responsibilities might be, but at a high level, all on-call rotations match this basic definition.

DEFINITION On-call rotations designate a specific individual to be the primary point of contact for a system or process for a period of time.

Notice that the definition doesn’t go into any specifics about the responsibilities of the on-call person. That varies by organization. On some teams, the on-call person may serve as just a triage point to determine whether the issue needs to be escalated or can wait until a later time. In other organizations, the on-call person may be responsible for resolving the issue or coordinating the response to whatever triggered the escalation. In this book specifically, I discuss on-call rotation primarily as a means for supporting and troubleshooting systems after hours, when they enter a state the team has deemed abnormal.

With this definition in mind, a typical on-call rotation will last a full week, with each staff member in the rotation taking turns in this one-week assignment. Defining an on-call rotation eliminates a lot of the guesswork for staff members who need assistance after hours, while at the same time helping to set expectations for the staff member who is on call for the week.

If you’ve detected a problem in the middle of the night and need assistance, the last thing you want to do is wake up four different families to find the person who can help you. The on-call staff member knows that they might be called at odd hours and can take the actions necessary in their personal lives to ensure availability. Without an on-call schedule, I guarantee you that when something goes wrong with the database, anyone who can even spell SQL will be in the woods on a camping trip with absolutely no cell phone coverage. The on-call rotation helps to establish that Raymond needs to sit this trip out in case he’s needed.

At the heart of the on-call process from a technology perspective lie your metrics tool, your monitoring tool, and your alerting tools. (They may be one and the same, depending on your stack.) Though your monitoring tool may be able to highlight an abnormal condition in your system, actually getting in touch with the right people to handle that situation is done by your alerting system. Depending on the criticality of your operation, you’ll want to ensure that your alerting system can reach engineers via phone calls and email notifications. Some more commercial offerings also include apps for mobile devices that can receive push notifications.

It’s important that this notification is automated to improve your response time to outages. Without having automated notification, an on-call rotation doesn’t provide the full value of having a problem investigated proactively. Unless you have a full 24/7 network operations center, without an automated notification system, you’re probably relying on a customer noticing an issue with the website and then sending communication through your support channels, which probably are also not online in the wee hours of the night.

There’s no better way to start the morning than sitting in a meeting explaining to your bosses why the site was down for three hours and nobody noticed. Despite sleep being a biological necessity, it still won’t work as an excuse.

This is where your automated notification systems come in. There are several players on the market, as well as at least one open-source solution:

These tools not only maintain the on-call schedule for your team, but also integrate with most monitoring and metric systems, allowing you to trigger the notification process when a metric exceeds one of your defined thresholds. Defining these criteria is the focus of the next section.

6.3 Defining on-call rotations

On-call rotations can be a difficult thing to reason about. You have a lot of factors to consider, such as the size of the team, the frequency of the rotation, the length of the rotation, and how your company will deal with compensation. It’s an unenviable task that, of course, you’re going to have to tackle if you’re responsible for mission-critical systems.

On-call rotations are touchy because they typically happen organically within an organization, skipping the formality of other employment structures. Without a lot of forethought, your on-call rotation can easily be set up in an inequitable way, putting a heavy burden on staff members.

I’ve yet to have an interview for an operations engineering role where on-call wasn’t discussed, usually with the question, “So, what’s the on-call rotation like?” If you don’t take special care in designing the rotation, hiring managers have only two choices: lie, or tell the truth and watch your candidate’s interest in the role wane in real time. Hopefully, this chapter will help you with a few of those hurdles.

On-call rotations should consist of the following:

  • A primary on-call person

  • A secondary on-call person

  • A manager

The primary on-call person is the designated point of contact for the rotation and is the first person to be alerted in the event of an on-call event. The secondary on-call person is a fallback if the primary isn’t available for some reason.

The escalation to the secondary on-call person might be coordinated via scheduled time, for example, if the primary on-call person knows they’ll be unavailable for a brief period of time. Life tends to not adhere to on-call schedules. Bad cell phone service, personal emergencies, and sleeping through alert notifications are all dangers to the on-call process. The secondary on-call role also is designed to protect against this.

Finally, the last line of defense is the manager. Once you’ve gone through the primary and secondary on-call person, chances are the team manager should be engaged not only for notification purposes, but also to expedite the on-call response.

Moving through these various on-call tiers is called escalation. Knowing when to escalate is going to depend on the team in question, but service-level objectives (SLOs) should be defined for the response time to alert notifications. The SLO should typically be broken into three categories:

  • Time to acknowledge

  • Time to begin

  • Time to resolve

6.3.1 Time to acknowledge

Time to acknowledge is defined as the amount of time an engineer has to confirm receipt of an alert notification. This ensures that everyone is aware that the engineer has received the notification and that something needs to be investigated.

If an alert notification isn’t acknowledged within the predefined SLO, the alert can be escalated to the secondary person in the on-call rotation, and the timer for the SLO starts again for the new engineer. If the SLO is again violated, another escalation occurs to the manager (or the next person in your escalation path if you’ve defined it differently).

This continues until someone acknowledges the alert. It’s important to point out that once an engineer has acknowledged an alert, they own that alert through the resolution process, regardless of their on-call status. If you acknowledge an alert and cannot work on it for whatever reason, it’s your responsibility to get that notification handed off to another engineer who can work on it. This is an important rule designed to avoid a notification being acknowledged, but confusion existing regarding responsibility. If you acknowledge a notification, you own the resolution process for that notification until you can hand it off.

6.3.2 Time to begin

Time to begin is the SLO for how long before you begin working on a resolution to the issue. Acknowledging the notification is the signal that the on-call engineer is aware of the issue, but due to circumstances, might not be able to begin working on the issue immediately. In some cases, that could be fine. In others, it could be problematic.

Defining a time-to-begin SLO also helps the on-call engineer schedule their personal life. If the expectation is that an alert gets worked on within 5 minutes of notification, you probably won’t travel anywhere without a laptop strapped to your back. If the SLO is 60 minutes, though, you have a bit more flexibility.

This time to begin will vary between services, which could make for a complicated web of exceptions. If the company order-taking platform is down, waiting 60 minutes for work to begin will obviously be unacceptable. But similarly, if even one service has a short SLO (like 5 minutes), the entire on-call experience will be governed by that SLO, because you have no idea what will break or when!

In these situations, planning for the worst-case scenario can lead to burnout of your on-call staff. You’re better off planning for the most-likely scenario and using the escalation path to lend an assist when response time is critical. The primary can acknowledge the alert notification, but then immediately begin using the escalation path to find someone who might be in a better position to respond within the SLO.

If this happens repeatedly, you might be tempted to alter your on-call policy for faster response time. Resist that urge and instead alter your priorities so that the system isn’t crashing so regularly. I’ll talk a bit more about prioritization later in the book.

6.3.3 Time to resolve

Time to resolve is a simple measure of how long things can be broken. This SLO can be a little fuzzy because, obviously, you can’t create a bucket large enough to encompass every type of conceivable failure.

Time to resolve should serve as a touch point for communication purposes about the issue. If you can resolve the problem within the SLO, congratulations! Go back to sleep and tell everyone about it at stand-up tomorrow. But if you’ve violated the SLO for time to resolve, this is the point where you should begin notifying additional people about the incident.

Again, in this scenario each service might have a different SLO with differing levels of engagement. Does the alert represent a service being completely down or just in a degraded state? Does the alert impact customers in any way? Understanding the impact to key business indicators or deliverables will influence the time-to-resolve SLO per service.

6.4 Defining alert criteria

Now that I’ve defined what an on-call rotation is, I want to spend a little time talking about what makes the on-call process useful, starting with alert criteria. It can be easy to fall into the trap of sending alerts for all the possible scenarios you can think of that might seem bad. Because you’re thinking of these items in a vacuum, you’re not attuned to how problematic some alerts can be.

What seems like an abnormal condition on the whiteboard might be a condition that your system enters into and out of rapidly and regularly. You might not fully understand all the necessary contextual components that define a scenario. For example, if you’re driving on a long, empty highway, having a quarter of a tank of gas can be a pretty bad scenario. But if you have a quarter of a tank in the city, it’s much less of an alarming state. Context matters.

The danger of creating an alert without context is that after an alert is created, it is somehow imbued with a sense of finality. Because of some deep, unknown psychological trauma that afflicts on-call staff members, it can be nearly impossible to convince people to remove a defined alert. People start to recall old stories of “Remember that one time when the alert was right!”, and discounting the other 1,500 times it was wrong.

Use the example from earlier in the chapter, where Raymond received a CPU alert. The fact that the CPU was that high for an extended period sounds like something you’d want to alert on when you’re designing alerts. But your mileage may vary depending on your workload.

So, let’s talk about what makes a good alert. First, an alert should have some sort of associated documentation. It might be details that go directly into the alert or it might be a separate document that explains the steps that should be taken when someone receives the alert. This documentation in all its various forms is collectively known as the runbook.

The runbook documents not only how to resolve the issue, but also why the alert is a good alert in the first place. A good alert has the following characteristics:

  • Actionable --When the alert triggers, it points to the problem and a path for the solution. The solution should be defined in the alert message or in the runbook. Linking directly to the runbook inside the alert notification eliminates the difficulty of finding the correct runbook for a given scenario.

  • Timely --The alert does as little forecasting on the impact as possible. When the alert fires, you feel confident that you need to investigate it immediately instead of waiting five minutes to see if the alert clears on its own.

  • Properly prioritized --It’s easy to forget that alerts do not always have to wake someone in the middle of the night. For alerts that are needed for awareness, convert those to lower-priority notification methods such as email.

With these three items, you can construct questions for yourself while crafting the alert criteria:

  • Can someone do something with this alert? If so, what should I recommend they look at in the system as part of their research?

  • Is my alert too sensitive? Will it possibly autocorrect itself, and if so, in what time period?

  • Do I need to wake someone up for this alert, or can it wait until morning?

In the next section, I’ll begin to discuss crafting the alerts and thresholds that lead you to ask each of these questions and ensure you’re making useful alerts.

6.4.1 Thresholds

At the center of most alerting strategies are thresholds. Defining upper and lower thresholds for a metric is important because in some cases something being underutilized is just as dangerous as something being overutilized. For example, if a web server isn’t processing any requests, that could be just as bad as receiving too many requests and becoming saturated.

The hard part about identifying thresholds is, what makes a sane value? If your system is well understood enough to be able to accurately define these values, you’re in a better position than most. (In most places, performance testing is always scheduled for next quarter.) But if you’re not sure, you’ll need to rely on empirical observations and tweak your settings as you go.

Start by observing the historical performance of the metric. This will give you a baseline of understanding where the metric can live in times of good performance. Once the baseline has been established, you should pick a threshold alert that’s about 20% higher than these baseline numbers.

With this new threshold set, you’ll want to set the alerting mechanism to issue low-priority alerts. No one should be awakened for these metrics alone, but notifications should happen. This allows you to review the metric and evaluate whether the new watermark is problematic. If it’s not, you can raise the threshold by another percentage. If there was an incident that happened at this threshold level (or worse, below it) you can adjust the threshold based on the incident surrounding it.

This technique is also useful for becoming aware of growth trends in your infrastructure. I’ve often set up threshold alerts not as a mechanism to detect problems, but as a checkpoint on growth. Once the alarm starts firing, I realize that the demand on the system has grown X%, and it gives me an opportunity to begin looking at capacity-planning options.

Thresholds will always be a work in progress. As demand grows, capacity will grow, and the capacity will change your thresholds. Sometimes basic thresholds of an individual metric aren’t enough. Sometimes you must combine two signals into one to ensure that you are alerting on something meaningful.

New alerts, when there is no baseline

If you’re just starting out on your metrics journey, you might be living in a world with no historical performance to really point to. How do you go about creating an accurate threshold when you have no idea about past performance? In this scenario, creating an alert is probably premature. Get the alert out there and creating data. Just having the data puts you in a better place than you were previously.

If you’re adding this metric because of a recent outage, you may still want to collect at least a day’s worth of values before you pick a threshold. After you have some data, pick a threshold alert to start that’s above the 75th percentile of your collected set of values.

Let’s say we’re talking about response time for a database query. If you find that the 75th percentile is 5 seconds, maybe make your initial threshold 15 seconds. This will almost certainly require revising, and I recommend you follow the same iterative approach detailed earlier. But it gives you a starting point to work from and a process for tuning it. If your monitoring tool doesn’t allow you to calculate percentiles, try exporting your data and calculating it in Excel by using its PERCENTILE function.

Composite alerting

High CPU utilization on your web tier could be bad. Or it could be you getting your money’s worth out of the hardware. (I detest super-large servers running at 10% utilization. You should too.) Now imagine you have another metric of checkout processing time. If that metric crosses its threshold and the web tier CPU utilization metric are both alerting, this can help to give the person on call a more robust picture of what’s happening. This sort of composite alerting is huge, because it allows you to tie the performance of a system component to a potential customer impact.

Composite alerting can also be useful when you know that a problem manifests itself across multiple axes. For example, while managing a reporting server, I measured several different points. Sometimes because of the way the reporting service worked, a large report would generate a long, blocking HTTP call. This metric would spike the latency on the load balancer. (Because now suddenly an HTTP call was taking more than 45 seconds.) So, I’d get an alert, only to find out that someone ran a long query. But then occasionally, the same latency alert would be a sign that the system was becoming unstable.

To address this, I created a composite alert that would monitor not just for high load-balancer latency, but also for sustained high CPU utilization and signs of memory pressure. These three items individually could indicate a normal system that is just encountering a brief period of load due to a user request. But all three triggering at the same time was almost always a signal of doom for the system. Creating an alert that fired only if all three alerts were in a bad state not only allowed us to cut down on the number of alerting messages, but also gave me confidence that when the composite went off, there was a definite problem that needed to be addressed.

Not all tools support composite alerting, so you’ll need to investigate the options at your disposal. You can create composite alerts on many monitoring tools, or you can handle the composite logic on the alerting side with the alerting tools I mentioned previously.

6.4.2 Noisy alerts

If you’ve ever participated in an on-call rotation, you know that a percentage of alerts probably are completely useless. Although they are well-intentioned, these alerts don’t deliver on any of the three criteria that I’ve defined for a good alert. When you come across a useless alert, you need to put as much energy as possible into either fixing it to be more relevant or just completely deleting it.

I know some of you may be wondering, “But what happens when there’s an outage and this alarm could have given us a warning!” It’s true that this sometimes occurs. But if you really look at the effectiveness of an alert that has cried wolf one too many times, you must admit that the likelihood of someone reacting critically when it goes off is small. Instead, you roll over and say to yourself, “If it goes off again, then I’ll get up and check it.”

If you’ve ever been to a hospital, you’ll realize that machines make noises all day long. Nurses aren’t running back and forth between rooms in a tizzy, because they’ve become completely desensitized to the sounds. Everyone reacts to major alarms, but for the most part blips and beeps hum throughout the day without any real reactions from the staff they’re intended to keep updated. Meanwhile, as a patient unattuned to those sounds, you’re going into a panic every time a machine starts flashing lights.

The same thing happens in technology. I’m sure you have your alarm that says, “The website is down,” and that alert triggers an intense triage process. But that low disk space alert that fires every night is usually ignored, because you know that when the log-rotate script runs at 2 a.m., you’ll reclaim most of that disk space.

As a rule of thumb, you should be tracking the number of notifications that go out to on-call staff members per on-call shift. Understanding how often team members are being interrupted in their personal lives is a great barometer not only for the happiness of your team (which I’ll discuss a bit more later in this chapter), but also for the amount of nonsense alerting that occurs in your organization.

If someone is being alerted 75 times per on-call shift, that’s a level of interruption that you should feel throughout the tech organization. With that many notifications per shift, if the pain of system instability isn’t felt outside the on-call team, then chances are you’re dealing with a noisy alerting system.

Noisy alerting patterns

Earlier in this chapter, I discussed three attributes that make a good alert. The alert must be

  • Actionable

  • Timely

  • Properly prioritized

Timeliness of an alert can often help silence noisy alerts while still preserving their value. The catch is, like everything in technology, there’s a slight trade-off. Most alerts that are designed by people are designed to alert the moment they detect a bad state. The problem with this approach is that our systems are complex, fluid, and can move through various states relatively quickly. A bad state can be rectified moments later by means of an automated system. Let’s go back to the disk space example.

Imagine a system that is being monitored for low disk space. If disk space goes below 5 gigabytes of available storage, the system alerts and pages someone. (1995 me could only imagine having to worry about 5 GB free, but I digress.) But imagine a world where this system is backing up to local disk before shipping it to another storage mechanism like Amazon Simple Storage Service (S3) or a Network File System (NFS) backup mount. The disk space would spike temporarily but eventually be cleaned up when the backup script cleaned up after itself or after the logrotate command ran and compressed files before rotating them out. When you’re paging on the immediate detection of a state, you send a page that isn’t timely, because the state is temporary.

Instead, you could extend the detection period of that state. Say you make a check every 15 minutes. Or maybe you send the alert only after four failed checks. The downside is that you could potentially send an alert out 45 minutes later than you normally would have, losing time you could have spent recovering. But at the same time, when you receive the alert, you have confidence that this is something that needs action and resolution, instead of snoozing the alert a couple of times, hoping that it takes care of itself. Depending on the alert, it can be better to get an alert late but know that you need to act.

Of course, this doesn’t work for all types of alerts. Nobody wants a “website down” alert 30 minutes late. But the criticality of the alert should also dictate the quality of the signals necessary to detect it. If you have a noisy alert for a critical part of the system, your focus should be on increasing the quality of the signal to detect the condition. You should define custom metrics that are emitted so that you can alert on a strong signal (see chapter 3).

For example, if you’re trying to create a system-down alert, you wouldn’t want to compare CPU utilization, memory utilization, and network traffic to determine if the system might be having an outage. You would much prefer that some automated system is attempting to log in to the system like an end user and perform a critical action or function. That task might be more intensive to perform, but it’s necessary because the criticality of the alert requires a strong, definitive signal (The system is down! Wake everyone up!) versus needing to infer the condition through multiple metrics.

Imagine that your car doesn’t have a fuel meter. Instead it tells you how many miles you’ve driven, how many miles since your last fill-up, and the average number of miles per full tank. You wouldn’t want that car! Instead you’d pay a little extra to have a sensor installed that detected the fuel level by measuring it specifically. That’s what I mean when I say choose a quality signal for critical functions. The more important the alert, the greater need for a quality metric to base it off.

The other option to dealing with noisy alerts is to simply delete them. Now if you have children or other people who depend on your livelihood, you might be a little concerned with just shutting off alerts. (As of this writing, the job market is pretty hot, so you might be willing to live dangerously.)

Another option is to mute them or lower their priority so they’re not waking you up in the middle of the night. Muting the alert is a nice option because a lot of tools will still provide the alert history. If you do encounter a problem, you can look at the muted alert to see if the problem would have been detected and what value the alert could have brought. It also lets you see how many times the alert would have fired erroneously.

If your tool doesn’t support that kind of functionality, you can change the alert type from an automated wake-up call to a nice quiet email. The nice thing about this approach is that the email being sent can be used as an easy datapoint for reporting. Every email is an avoided page, an avoided interruption at dinner, an avoided break in family time. But your email client also becomes an easy reporting tool for the paging frequency (assuming your alerting tool doesn’t have one).

Group your emails by subject, sender, and date, and you have an instant snapshot into the frequency of alarms. Combine that with the number of times you had an actionable activity around the page, and you can quickly develop a percentage of alert noise. If you received 24 pages and only 1 of them had an actionable activity, you’re looking at a noisy alert rate of approximately 96%. Would you follow stock advice that was wrong 96% of the time? Would you trust a car that wouldn’t start 96% of the time? Probably not. Of course, if you don’t take the next step of examining the effectiveness of these email alerts, you’ll instead write an email filter for the alert that automatically files it into a folder and you’ll never see it again. Be better than that.

Noisy alerts are a drag on the team and don’t add any value. You’ll want to focus as much energy as possible on quantifying the value of the alert along with creating actionable activities when the alert fires. Tracking the noise level of an alert over time can be extremely valuable.

Using anomaly detection

Anomaly detection is the practice of identifying outliers in the pattern of data. If you have an HTTP request that consistently takes between 2 and 5 seconds, but then for a period it’s taking 10 seconds, that’s an anomaly.

Many algorithms for anomaly detection are sophisticated enough to change the range of accepted values based on time of day, week, or even seasonality. A lot of metric tools are starting to shift toward anomaly detection as an alternative to the purely threshold-based alerting that has dominated the space. Anomaly detection can be an extremely useful tool, but it can also devolve into something just as noisy as standard threshold alerting.

First, you’ll need to ensure that any alert you create based on anomaly detection will have enough history for the anomaly detection algorithm to do its work. In ephemeral environments where nodes get recycled often, there sometimes isn’t enough history on a node for the application to make accurate predictions on what an anomaly is. For example, when looking at the history of disk space usage, a sudden spike in usage might be considered anomalous if the algorithm hasn’t seen a full 24-hour cycle for this node. The spike might be perfectly normal this time of day, but because the node has existed for only 12 hours, the algorithm might not be smart enough to recognize that and generate an alert.

When designing anomaly-based alerts, be sure that you think about the various cycles that the metric you’re alerting on goes through and that there will be enough data for the algorithm to detect those patterns.

6.5 Staffing on-call rotations

One of the most difficult parts of creating an on-call rotation is staffing it appropriately. With a primary and a secondary on-call role, you can quickly create a scenario where people feel they’re constantly on call. Knowing how to staff teams to deal with on-call shifts is important to maintaining the team’s sanity.

The on-call rotation size cannot be strictly dictated by the size of the team. There are a few items you must consider. For starters, the on-call rotation is like porridge: not too big, not too small. It must be just right.

If your on-call rotation is too small, you’ll have staff members quickly burning out. Remember that being on call is a disruption to people’s lives, whether they receive an off-hours alert or not. But at the same time, having a team too large means that people are part of the on-call rotation too infrequently. There’s a kind of rhythm to the on-call process that requires the ability not only to function at odd hours of the night, but also to understand the trends of the system over time. Staff members need to be able to evaluate whether an alert is problematic or is signaling the beginning of a larger potential problem.

The minimum long-term size of an on-call rotation is four staff members. In a pinch or temporarily because of attrition, a team of three can be used for brief periods, but in all honesty, you don’t want to go below four staff members for the rotation. When you consider what the on-call rotation entails, you’re realistically thinking of a team member being the primary person on call and the secondary person on call during a rotation, usually back-to-back. With a four-person rotation, this means an engineer is on call twice per month. Depending on your organization, the stress of being secondary on-call might be significantly less than being primary, but it still is a potential for disruption in someone’s personal life.

The interruptions aren’t just personal. On-call duties can strike during business hours as well, stealing away precious time working on the product to deal with firefighting issues. The mental penalty for switching from project work to on-call work and back again is often overlooked. A 15-minute disruption can result in an hour or more of lost productivity as engineers try to mentally shift between these modes of work.

A minimum rotation of four people might be easy for some organizations that have a large department to pull from. But for smaller organizations, coming up with four people to be part of the on-call process might be daunting. In this case, you may have to pull people from various other groups to participate in the rotation.

When you have representation from multiple teams, you run the risk of having someone on call who doesn’t have all the necessary access they need to resolve an issue. In an ideal world, you’d want only teams directly responsible for a service in the on-call rotation. But in the exploding world of smaller and more numerous services, that team might have only two engineers on it, a product person and a QA engineer. That’s a big responsibility for two people.

The need to potentially expand your support team beyond the immediate service creators is going to put a heavy emphasis on your automation practices. The desire to minimize the number of people with direct production access is in direct conflict with needing to integrate people from other engineering groups into the on-call rotation. The only fix here is automating the most common tasks used in troubleshooting so that the on-call engineer has access to enough information to properly triage the issue.

Note that I used the word “triage” and not “resolve.” Sometimes in an on-call scenario, an immediate resolution might not be the answer to a page. There’s benefit to be gained just from having a human evaluate the situation and decide whether it’s something that needs to be escalated or something that can remain in its present state until the right staff members are available to handle it during working hours.

Being alerted for an issue, but not having the tools or access necessary to fix it, is not an optimal position to be in. The only thing worse than being alerted and not having the access necessary to fix the problem is having the access but being interrupted in the middle of a movie theater right before the big kiss or fight scene or villain monologue or however your favorite type of movie typically resolves itself. If the options are to beat a small group of people into on-call submission by having a short rotation, or to add people who might not be able to fix every problem they’re paged for, the lesser of the two evils is clear.

You might find yourself in an exceptionally small organization with only two or three engineers in the entire company! There’s a need for on-call rotations, but no staff to hit the minimums. In that case, all I can really say is, “Congratulations, you’re probably a startup co-founder!” But seriously, in those situations being scrappy is the name of the game. The best solution is to ensure that you’re directly and swiftly addressing the problems that come up as part of on-call through your sprint work. In my experience, with groups this small, not only do they own development and on-call rotations, but they also own prioritization. That allows for rapid fixes to nagging problems.

As I alluded to earlier, teams can also be too big for on-call rotations. If you have a team of 12 engineers rotating on call, that means, assuming a one-week rotation, that you’re on call only about three times per year. I can’t think of anything that I do only three times a year that I can maintain proficiency at. Being on call is a muscle. Being able to be effective in crunch time is an important skill to have. But if you’re on call once a quarter, think of all the small things that you don’t do regularly that could become a problem. Are there specific activities that are done only when you’re on call? How good are your skills around incident management if you use them only once a quarter? Where is that wiki document that has all the notes for how to solve specific problems? If you think hard, you’ll probably come up with a bunch of things that aren’t done regularly when you’re not on call. All these tasks can add time to your recovery efforts because someone is out of practice.

Another downside to having a large on-call rotation is that the pain of being on call is too dispersed. I know that sounds weird, so just hear me out for a second. When you’re on call every four or five weeks, the pain of the on-call process can become familiar. The nagging issues that pop up are frequent enough to you and to all the other members of the on-call rotation that you feel motivated to resolve them. But if the rotation is once per quarter, that pain gets diluted between your on-call rotations and becomes a permanent piece of technical debt. The squeaky wheel gets the grease, unless you use that wheel only four times a year.

Given those concerns, what is the optimal maximum size of the on-call rotation? You’ll have to take into account just how many interruptions your on-call process generates, but my rule of thumb is that teams have rotations of no larger than eight people, preferably around six.

Once you get more than six people per on-call rotation, I recommend splitting up services and creating multiple on-call rotations around a grouping of services or applications. Depending on how your teams are organized, this split may be logical based on your organizational structure. But if there isn’t an obvious breakdown, I recommend reporting on all of the on-call notifications that have been generated and grouping them by application or service. Then spread the services based on alert counts across the teams.

This might cause a bit of an imbalance when you realize that some of your engineers who are the most well versed in a technology are not on the team supporting that technology after hours. Not only is this OK, but I encourage it. You don’t want the expertise of a technology to be concentrated in a single engineer.

After-hours on-call support gives an opportunity for other engineers to not only get involved with the technology, but also to be exposed the underbelly of the technology stack through incidents. A dirty little secret about operating production software is that you learn more when it breaks than when it operates normally. Incidents are incredible learning opportunities, so having someone who isn’t the expert on a technology participate in the on-call rotation can be a great way to help them level-up their skills. Just have the expert’s number on speed dial, just in case, and continue to update your runbooks for each incident.

6.6 Compensating for being on call

I have a philosophy around compensation for on-call rotation. If engineers are grumbling about compensation, chances are it’s because your on-call process is too onerous. Fairly compensated, salaried professionals seldom gripe about having to work the occasional few hours outside the normal work week. This is a trade-off that gets made in exchange for the flexibility that is often accompanied by being a salaried professional. This isn’t to say that compensation isn’t deserved by those on call, but if your staff is complaining about it, I would argue that you don’t have a compensation problem as much as you have an on-call problem.

Most people I’ve talked to who have a decent on-call experience are content with the unofficial compensation strategies that their managers employ. That said, it’s often beneficial to have some sort of official compensation package for on-call staff members. Just remember that regardless of how little or how much an on-call person is interrupted, if they want official compensation, then they deserve it and should be entitled to it.

I once had to cancel an on-air radio segment because I couldn’t find someone to swap on-call duty with. I never got alerted, but I could have been. How would I handle that on the air? Even if people aren’t being alerted after hours, it still places demands on their time. I’ve encountered a few compensation options that seem to work well.

6.6.1 Monetary compensation

Cash is king. A monetary compensation policy has a few benefits. For starters, it lets employees know that you truly appreciate the sacrifice that they’re making for the organization. The extra cash can also be used to incentivize volunteers for the on-call rotation in the event it’s not mandated as part of the employee agreement. (And even if it is mandated, you should still consider and implement a compensation strategy of some sort.) To keep it easy, a flat bonus can be applied to the employee’s payroll during on-call weeks. This keeps things simple and predictable.

Sometimes when working with monetary compensation, organizations will add in an additional hourly rate for when you receive a page and must perform after-hours work. This comes with the added benefit of putting a monetary value on the incidents that get generated after hours. As a manager in charge of a budget, I might be content with an application paging out when it’s in need of a reboot a few times a week. But when I’m on the hook for the rising on-call expenses, it affords me a new perspective on developing a new, more permanent solution. (It’s always about incentives.)

On the same token, some people worry that this disincentivizes workers to solve problems long-term because they lose out on a financial reward. But the truth is, the disruption to living life probably outweighs any sort of financial compensation that comes because of said disruption. In my experience, this is a nonissue, but admittedly, your mileage may vary.

A negative to compensation being tied to hours of work performed after the regular workday is it becomes an enormous accounting problem. Now I must track how many hours I worked during the incident. Do I include the hours when I thought the problem was fixed, but I was on eggshells watching the system and hoping the problem didn’t return? Do I include the time I was paged or just the time I was actually working? Maybe I was at an event and had to travel somewhere with my laptop to get to a suitable location to perform work. All this time tracking becomes a bit of a bore and can lead to some resentment from employees.

6.6.2 Time off

I’ve seen a lot of on-call rotations that are compensated with additional time off. This time off can be accrued in different structures, such as receiving an additional day of personal time off (PTO) for every week you’re on call (or sometimes only on weeks during which you received a page). This approach can work for a lot of teams, especially because as incomes rise, time becomes a much more precious resource than the monetary stipend you might get from financially compensated on-call teams (depending on how much that compensation is).

Using time off as a monetary compensation tool has a catch, and that’s how official it is. A lot of smaller organizations don’t have an official on-call policy. It’s a technical team necessity but doesn’t always get addressed by the organization at large. As a result, these time-off compensation strategies are often done as an unofficial agreement between the on-call staff and their manager.

Assuming a healthy relationship between manager and staff, this might not be an issue. But where does time off get tracked? What happens if your manager leaves? How does that on-call PTO balance get transferred? If you decide to bank two weeks of on-call PTO, how do you submit that into the human resources (HR) reporting system to use as vacation time? If you leave the organization, how do you transfer the value of that saved-up PTO into something that you can take with you? How do you deal with people who pick up an extra on-call shift?

These are common issues that arise when dealing with a time-off compensation strategy for on-call duty. The following are a few strategies that make this a bit easier to manage:

  • Ensure that HR is involved with on-call compensation talks. This allows the organization to make the compensation an official act as opposed to being handled differently by managers all throughout the organization. It also prevents a new manager from suddenly changing the arrangement without a formal process.

  • Don’t allow accrual of on-call PTO. If your HR team isn’t involved with the on-call compensation, just ensure that all staff members plan to use the on-call time as it’s earned. This prevents issues with accrued time off, but does create issues for management around resource planning, as the team has lost productivity baked into its schedule. Assuming a day per on-call week, you’re looking at 19 working days per team member per month, before any planned time off is considered. It sounds insignificant but adds up to roughly 2.5 working weeks per year. Across a team of four, that’s roughly 10 weeks a year lost to on-call compensation. To make matters worse, the staff has probably worked those 10 weeks plus some, but just doing relatively unproductive work (where productivity is measured as new features, capabilities, and value).

  • If you must accrue on-call PTO, log it somewhere. A shared wiki page or spreadsheet is probably the easiest solution.

6.6.3 Increased work-from-home flexibility

Another disruptive portion of being on call is the time period between getting ready to go to work and sitting at your desk. If you’ve ever had to field an on-call situation while simultaneously trying to get your kids off to school, you can understand the frustration. I’ve met some teams that have elaborate schedules to stagger the arrival of staff to ensure that they don’t have everyone trapped on a commute when the system decides to eat itself. Providing increased work-from-home time during an on-call week is another way of dealing with this.

Allowing a team member to work from home during their on-call week can help solve the commute problem, as well as give the team member back some flexibility that’s robbed from them during the on-call week. I’ve had to cancel personal tasks because I’ve been interrupted by an important on-call task. I remember times sitting at my kitchen table with my coat still on, working on a “quick fix” problem, only later to realize I’d been at it for over an hour and the place I was going to had closed. When the option to work from home during an on-call week is available, I have the option of running some of those personal errands during my lunch break or taking a quick 15-minute break during the day to handle it. Because my other team members are in the office to handle any issues that arise, it doesn’t become a burden on anyone.

If the flexibility of working from home isn’t typically an option in your organization, this can be an excellent way to give workers some freedom. But the question arises, is it fair or equitable? With on-call duty causing unscheduled interruptions in a person’s life, it seems a little callous to be rewarded by only being able to pick a different office to work from. Some people might find this flexibility a worthy trade-off; others might want that cold, hard cash.

This solution is the most problematic because not everyone values work from home the same. For instance, I personally prefer to be in the office, because I like the socialization aspect. Sometimes folks have different reasons for their need to be in the office, making the work-from-home solution for a given week more problematic than helpful. They could have a lunch date with a friend or an important meeting that will be difficult to do remotely. If you plan to use this option, consider having a conversation with team members before you presume that this is will be a welcomed option by all.

In my experience, making the on-call experience as painless as possible is the best way to help in the on-call compensation conversation. To understand the pain of the on-call process, however, you need to be paying attention to more than just the number of pages, and understand the impact it has on people.

6.7 Tracking on-call happiness

Not all on-call interruptions are created equal. Looking at a report of the on-call statistics for a given period, you might learn that there were 35 on-call alerts during a reporting period. But that number doesn’t tell the entire story. Were all 35 pages on a single night? Were they primarily during the day? Were they spread across the rotation, or did one person receive a disproportionate number of those alerts?

These are all questions that you should be asking when evaluating your (or your team’s) on-call experience. The answers can grant some powerful insights into the on-call process. You’ll also get data on increasing efforts to reduce the on-call burden or increasing the level of compensation.

A few pieces of information need to be tracked in order to get a sense of the on-call experience on your teams:

  • Who is being alerted?

  • What level of urgency is the alert?

  • How is the alert being delivered?

  • When is the team member being alerted?

Tracking each of these categories will give you a lot of solid insight.

6.7.1 Who is being alerted?

For this, you’ll need to be able to report beyond just a team level or service level. You’ll want to know specifically which team member is fielding the alert. This is probably the most important piece to identify, because you’re attempting to understand the impact on individuals, not just systems.

It’s easy to lump alert counts into teams or groups, but the truth is, some systems follow specific patterns. In a four-person rotation, it’s not uncommon for team members to always have their on-call duty fall on a particular time of the month. If the billing process always runs the last week of the month, then the fourth person in the on-call rotation might disproportionately see pages related to that business event or cycle.

I’m certain that these signals will show up in one of the other areas that you’re reporting on. For example, the number of incidents could jump on the fourth week of the month, prompting some investigation. But the idea is to view this data from the perspective of people. Rolling up that data could erroneously erase the impact it’s having on an individual.

6.7.2 What level of urgency is the alert?

An alert might come in that represents signs of far-off impending doom. A database server might be days away from running out of disk space. A trend you’ve been tracking has crossed a threshold you’ve defined and warrants investigation. A server might have gone without patching beyond a certain number of days. These are all alerts of things that are wrong in the environment, but they don’t require immediate action. The alerts are more informative and as a result can be less disruptive to the on-call team member.

Contrast that with an alert that says, “Database system is unstable.” That alert is not only a bit vague, but also alarming enough to warrant looking into immediately. These alerts are of a different urgency and have different response expectations. Letting an unstable database alert sit until morning could have drastic consequences for your company’s business and for your continued employment with them. These are the high-urgency alerts that create pain and friction in an employee’s life. Keeping track of the number of high-urgency and low-urgency alerts that a team member receives will help you understand the impact of that page.

6.7.3 How is the alert being delivered?

A low-urgency alert isn’t always delivered in a low-urgency fashion. Depending on your configuration, nonurgent alerts could be delivered in the most disruptive fashion possible. Even if an item is low urgency, the phone call, the text alert, the middle-of-the-night wake-up push notification--all these signals can create stressors for on-call engineers. These stressors lead to more frustrations and lower job satisfaction, and will lead to folks responding to requests on LinkedIn.

I can remember a time when I reviewed my on-call reports to find that one member on the team was getting a disproportionate number of alerts notifying him via phone calls. The metric was a pretty high outlier. I looked at the general number of alerts that went out during his on-call shift and saw that it wasn’t any higher than other team members; he just had more interrupting, phone call-based alerts. After a little investigation, I discovered that his personal alerting settings were set to alert via phone call on all alerts, regardless of urgency. He was unknowingly making his on-call experience worse than everyone else’s. Tracking this metric was the key to helping him achieve a more relaxed on-call experience.

6.7.4 When is the team member being alerted?

An alert at 2 p.m. on a Tuesday doesn’t bother me as much as an alert at 2 p.m. on a Sunday. When team members are being alerted after hours, it obviously creates a drag on their experience. You don’t need to create large elaborate buckets for this because there are only three periods of time that you care about:

  • Working hours --This is the time period you would normally be in the office. I’ll assume 8 a.m. to 5 p.m., Monday through Friday, in this scenario.

  • After hours --In this period, you would normally be awake but not working. Think of this generically as 5 p.m. to 10 p.m., Monday through Friday; and 8 a.m. to 10 p.m., Saturday and Sunday. Obviously, your personal schedule will alter these timelines.

  • Sleeping hours --These are defined as the hours of rest, when a phone call has a high likelihood of waking you from a restful state. These are defined as 10 p.m. to 8 a.m. every day of the week.

By grouping interruptions into these buckets, you can get a sense of just how much disruption you’re causing. The reports can also be used to trend over time to get a sense of whether on-call rotations are getting better or worse, beyond just a blind “number of alerts” metric.

If you’re using any of the major tools I mentioned previously (such as PagerDuty, VictorOps, or Opsgenie), they all have options for this sort of reporting out of the box. If not, I recommend making this data highly visible. If you’re doing custom alerting, you might want to consider emitting a metric for every type of page or alert as a custom metric. This will allow you to create dashboarding in your current dashboard tool. Radiating this information in an easily accessible dashboard or report will help draw attention to it.

6.8 Providing other on-call tasks

The nature of on-call work makes it difficult to focus on some project work. The amount of context switching that occurs can make it difficult to focus, depending on the alert volume that happens during the day. Instead of fighting the realities of on-call rotations, it’s worthwhile to lean into it and structure on-call work beyond just the after-hours support dumping ground and shape it into an opportunity to hopefully make on-call time better.

A common additional task for on-call team members is to play the sacrificial lamb for the rotation and focus on all ad hoc requests that come in for the week. By ad hoc, I’m specifically talking about those tickets that come in unscheduled but cannot wait and need to be prioritized before the next work planning session.

An example could be a developer needing elevated access privileges temporarily to troubleshoot a production issue. This type of request could be handled by the on-call staff member. It would create a clear line of responsibility and would streamline communication from the point of the requester, because the requester knows that a dedicated resource is set to perform the task. There is a counterargument, however, that by relegating someone to only support work for the week makes the on-call experience even more unattractive. That’s true to a certain extent, but I think you can also sweeten the pot a bit by allowing some self-elected project work.

6.8.1 On-call support projects

When you’re on-call, you become acutely aware of all the problems that are occurring that wake you up in the middle of the night or interrupt your family dinner. Consider allowing your on-call staff to not partake in the normal prioritized work and instead focus on projects that would make the on-call experience a bit better. Being allowed this focus during the on-call week has its advantages as well, because the problem remains fresh in your memory. Ideas for some on-call projects could be as follows:

  • Updating runbooks that are either outdated or don’t exist at all

  • Working on automation to ease the burden of dealing with specific problems

  • Tweaking the alerting system to help guide on-call staffers to the source of potential issues

  • Implementing permanent fixes to nagging problems

And you can probably think of a few more options specific to your organization. Finding a permanent solution to nagging problems is probably going to be the most impactful and rewarding. It’s also the option that’s most overlooked in conventional workflows.

What usually happens is the on-call person fields a few calls that have a specific permanent solution that just needs to be prioritized by the team and worked on. But other firefighting takes place, and as the distance from the issue increases, the likelihood of repairing it also decreases. In a one-week rotation, each staff member has the risk of being exposed to the problem only once per month, assuming a four-person rotation. By giving the current on-call engineer the ability to prioritize their own work, they can address the problem as it’s fresh in their memories.

By creating an environment in which the on-call engineers have the power to make their jobs better, you not only reduce some of the burden of the process, but also empower engineers to make it better. This sense of control gives them ownership of the experience.

6.8.2 Performance reporting

Being on call gives you a view into a running system that many others never quite see. The reality of a system changes significantly between the time it’s being designed and the time it’s operating and running in production. The only people who notice this change are the people who are observing the system while it’s running. For that reason, I think it’s valuable to have the on-call engineer provide a sort of state-of-the-system report.

The state-of-the-system report could be a high-level overview of how the system is behaving week to week. This report might simply review a few key performance indicators and make sure that they’re trending in the correct direction. It’s a good way to discuss the various on-call scenarios that the engineer faced and to bring other disciplines into the conversation. This allows an opportunity for the production support staff and the engineering staff to discuss the problems that are being faced while supporting production (assuming they’re not the same team).

To facilitate this sort of reporting, building a dashboard that highlights all the key performance indicators (KPIs) for the review is a good starting point. The dashboard serves as a set of talking points for the on-call engineer to review. It might just be confirming that things look good, but in some meetings a negative trend or event might warrant further discussion. Overlaying milestones across the dashboard is also beneficial. For example, if a deployment occurred during the on-call week, calling out on the graphs when the deployment occurred could be helpful if any specific step changes appeared in the graphs.

As an example, figure 6.1 is a graph of freeable memory on a database server. The two shaded bars denote a production deployment of the primary application that uses this database server. The obvious jump in memory might warrant a deeper conversation in the reporting meeting. Without this meeting, the memory issue might not have been noticed because memory never dropped low enough to trigger alerting by the monitoring system.

Figure 6.1 Overlaying deployment events on a graph

The power of this meeting is to systematically enforce some of the regular reviews and analysis of monitoring that should be happening within your teams, but probably isn’t because of understaffing. (Organizations typically discount the type of routine tasks that should be happening as a matter of good hygiene.) When you build out your reporting dashboard, think of all the things that you feel should be monitored on a regular basis but aren’t. Think about the things that you should be doing but don’t seem to find the time for.

Imagine if you had to display or report on your system’s security patching on a weekly basis, highlighting the last time servers were patched. The regular, public admission of a failing will bring energy to the understanding of why it’s failing and what can be done to correct it. If you haven’t applied security patches in nine months because you can’t find the time to do it, that admission regularly will lead to you getting the time necessary. To quote Louis Brandeis, “Sunlight is said to be the best of disinfectants.”

Summary

  • Alerts need to be actionable, timely, and properly prioritized.

  • Noisy alerts create alert fatigue, rendering the alert useless.

  • On-call duty is a disruption to staff members. They should be compensated in some form.

  • Track on-call happiness to understand the level of disruption in an engineer’s life.

  • Structure the on-call rotation so that engineers have time to make permanent fixes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.27.178