Chapter 30. Against On-Call: A Polemic

On-call, as we know it, must end. It is damaging to the people who do it,1 and inefficient as a means of keeping systems running. It is especially galling to see it continuing today given the real potential we have at this historical moment to eliminate it. The time for a reevaluation of what we do when we are on call and, more important, why we do it, is long overdue.

How long overdue? I can find evidence of on-call-style activities more than 75 years ago,2 and in truth there have been people tending computers in emergencies for as long as there have been both computers and emergencies. Yet, though there have been huge improvements in computing systems generally since then,3 the practice of out-of-hours, often interrupt-driven support—more generally called on-call—has continued essentially unaltered from the beginning of computing right through to today. Ultimately, however, whether the continuity is literally from the dawn of computing or whether it is merely from the last few decades, we still have fundamental questions to ask about on-call, the most important of which is why? Why are we still doing this? Furthermore, is it good that we are? Finally, is there a genuine alternative to doing this work in this way? Our profession derives a great deal of its sense of mission, urgency, and, frankly, individual self-worth, from incident response and resolving production problems. It is rare to hear us ask if we should, and I think the evidence clearly shows us that we should not and that a genuine alternative is possible.

But first, let us look at the rationale for doing on-call in the first place.

The Rationale for On-Call

Many SREs have an intuitive idea of why on-call is necessary: to wit, getting a system working again, and we shall come back to that shortly. However, for a fuller understanding, it is useful to look at the role of on-call in other professions, to focus our idea of what is unique to our case. Let us look at examples from medicine.

First, Do No Harm

In a medical context, the person on-call is the person on duty, ready to respond.4 In emergency medicine, the function of the on-call doctor is in some sense to be a portable decision-maker, with appropriate expertise where possible, and the ability to summon it otherwise. The first function is to perform triage, in which the doctor figures out from the signals available whether or not the patient should be in A&E5 in the first place and otherwise attempts to make the patient better. Not necessarily cure them; that is not the domain of emergency medicine. The goal of emergency medicine is to stabilize the situation so the patient can be moved to ward medicine, to manage a cure, the treatment of slow decline, or otherwise non-life-threatening, non-immediate situations.

Parallels with SRE

Emergency medicine is strongly interrupt-driven work, and the broad context of bringing people with expertise to fix a problem is exactly the same as with the SRE on-call context. Those are the strongest point of correspondence. Another parallel is the act of triage, which is also performed in the SRE context, although usually by software deciding that some metric has reached an unacceptable threshold and paging rather than manual action.6 A similar act is also performed when an SRE on-call operator decides some alert is not actually important enough to bother with. (Broadly, you could look at the overall emergency medicine challenge as how we can deliver relatively well-understood treatment plans in an efficient way while wrestling with many other simultaneous demands.)

Preparations for an on-call shift are similar, too. A demanding—potentially longer than 24-hour—work shift7 requires physiological preparation, and in on-call preparation documents I have also seen the equivalent of an SRE playbook, which is to say, a list of specific, somewhat tactical suggestions, for how to respond to different kinds of known failure in human biological systems.

Note

In the case of On Call Principles and Protocols, sixth edition (Elsevier), for example, the bulk of the chapters are about specific “subsystem” failures (e.g., abdominal pain, chest pain, seizures), with a seemingly strong Pareto Principle–style assumption that 80% of cases arise from 20% of the total causes.8

Finally, when a system is restored to an everyday working state, but serious cleanup work is necessary, that can be left to a normal daytime activity—or ward work, in other words.

Differences with SRE

Strong though the parallels are, working with software is fundamentally different. In some ways, the frightening thing about SRE on-call is that because so many of the systems we look at are changing so quickly, the act of being on call for a particular system in January is not hugely relevant when it comes to being on call for it in June. This is not the case in A&E, where people’s bodies and the life-threatening traumas that befall them tend to have quite well-understood manifestations and diagnoses, wide genetic variety and environmental factors notwithstanding. Although that rate of change can fluctuate given particular team or industry circumstances, it is almost never zero, and if it is, the software environment itself is almost always changing, too. (An SRE approach enables fast change, so this is to be expected.)

A useful analogy would be that SRE on-call is like dealing with entirely new kinds of human beings every A&E shift, in which the patient presents with an additional unexplained internal organ for which someone is still writing the documentation. If medicine had this to cope with, the treatment plan would have to be derived from first principles every time. That unbounded quality—that the causes and extent of the emergency might involve arbitrarily novel situations every time9—seems to be unique to software.

Of course, despite this, most adjacent shifts, and the problems encountered therein, are quite similar. But the worst-case scenario—that almost everything could have changed since your last shift—can and does happen.

Underlying Assumptions Driving On-Call for Engineers

So, the rationale for on-call in other professions is to bring expertise and resources to a problem as quickly as possible, to resolve that problem, and (often) to prevent a similar or larger problem from developing.10 Today, we place humans in those situations because the complexity of the world remains such that a robot programmed as well as we know how to do it today could not effectively act as, for example, a medic.11

However, the highly restricted environment inside a machine or machines, although complicated under certain conditions, is not as complicated as the real world. If the argument is that the real world is complicated enough to need a human doing the on-call, the same argument applied to the datacenter is not as clearly true.

Yet we continue to put engineers12 on call for services. Why?

Let me be brutally frank, fellow operations engineers: we are sometimes put on call for bad reasons. One well-known bad reason is because it is cheaper than solving the real problem; that is, it is cheaper to pay a human to just react and fix things manually when problems happen, rather than extend the software to do so. Another bad reason would be because on-call work is perceived to be awful. Therefore, product engineers who are not trained for it are very reluctant to do it, and the desire grows to pass this work off to a lower caste: operations engineers. Yet another one is the assumption that mission criticality (however critical that turns out to be), a “keep the site up” mentality, and cultivating a sense of urgency around production state all require being on call. Ultimately, however, these are dogma: firmly and long-held beliefs, which might or might not be useful.

I take a different approach and use the language of risk management for the purposes of outlining the valid reasons why we are on-call today. This allows us to be focused on matters relating to the impact on the supported systems rather than beliefs that might or might not be true or might change over time.

Let us therefore group the reasons into the following categories:

Known-knowns
Consider a system that has known bugs, and the circumstances under which they are triggered are known, their effect is known, and the remediation is known, too. Many on-call professionals today will be familiar with the sensation of being paged for something fitting this description. Of course, the obvious question is then, why are humans fixing this at all? As discussed, sometimes it is cheaper, sometimes not, but it is typically a decision made by business owners that the particular code paths that recover a system should be run partially inside the brains of their staff rather than inside the CPUs of their systems. (I suppose this is externalizing your call stack with a vengeance.) In this bucket, therefore, engineers are put on call because of cost; in reality, the problem is perfectly resolvable with software.
Known-unknowns
Many software failures result from external action or interaction of some kind, whether change management, excessive resource usage beyond a quota limit, access control violations, or similar. In general, failures of this kind are definitely foreseeable in principle, particularly after you have some experience under your belt, even if the specific way in which (as an example) quota exhaustion comes into play is not clear in advance. For example, these problems can sometimes be related to rapid spikes in traffic or other exceptional events. Although you can’t necessarily predict why in advance, it’s usually essentially statistically guaranteed that you’ll have a few really big spikes a year. Engineers are therefore put on-call because the correct automation or scaling is not currently available; the problem is again perfectly resolvable.
Unknown-unknowns
Despite theoretical positions to the contrary,13 systems and software do indeed fail. Today, certain kinds of failure can be automatically recovered from or otherwise responded to without human intervention—if not self-healing, then at least non-self-destroying. But there are large classes of failures that aren’t, and, what’s worse, these failures typically change as a system itself changes, its dependency list grows, and so on. In the language of engineering risk management, these are unknown unknowns:14 things that you don’t know that you don’t know, but you do know that you don’t know everything, so you can foresee their theoretical existence. Engineers are therefore put on-call because of the system potentially failing in ways that could not have been seen beforehand and therefore seemingly requires the kind of context-jumping response that only a human can give; the thing that makes this hard to respond to automatically is the complexity of what could have gone wrong.
The Wisdom of Production
This is conceptually very similar to our first two reasons; the difference is mostly in intention. We choose to put engineers on-call for a system to learn real things about how it behaves in real situations. We might learn unpleasant things about it or we might learn pleasant things, but we are doing it explicitly to gather information and to decide on where to put our effort for how to improve it.

It seems clear to me that most of the significant opposition to removing on-call as a job responsibility for SREs is because of the third category: unknown-unknowns.

On-Call Is Emergency Medicine Instead of Ward Medicine

But, actually, the only valid reason to put engineers on-call is the last one: The Wisdom of Production. The other categories are ultimately distractions.

The first category, known-knowns, involves humans executing procedures to fix things that could (by definition) be done perfectly well by machine; it merely happens to be cheaper or simpler for some set of particular humans to do it at that moment. This category is about expediency, cost control, and prioritization, not engineering. There is nothing here preventing a completely automatable approach other than money and time, which are obviously crucial things, but not a barrier in principle. Yet this category persists as a source of outages today, perhaps because of a widely held practice in the industry of treating operations as a cost center, meaning that no business owner will invest it in, because it is not seen as something that can generate revenue for the business, as opposed to simply accumulate costs.15

For known-unknown problems, the path away from manual action is generally more resources or pausing normal processing in a controlled way, with some buffer of normal operation before more-detailed remediation work is required. Indeed, the conditions in which the full concentration of an on-call engineer are legitimately needed to resolve a known-unknown problem usually involve a flaw in the higher-level system behavior. An application layer problem, such as a query of death16 or resource usage that begins to grow superlinearly with input, is again amenable to programmatic ways of keeping the system running (automatically blocking queries found to be triggering restarts, graceful degradation to a different datacenter where the query is not hitting, etc.). The important questions, then, are how flaws in higher-level systems are introduced and whether there is a meaningful way of working around this.

The situation is even more dispiriting for, as an example, system-level change-control problems: perhaps a problematic Access Control List (ACL) prevents access to a critical dependency; or maybe a runtime flag changes startup servers to a set that no longer exists or are vastly slower; or possibly a GRANT command accidentally removes access to the systems for the entity performing the GRANT. Although these might seem outside of the domain of automated response, it is canarying,17 which is the actual solution: it allows us to pilot a variety of difficult-to-reason about changes and observe the effects in a systematic way, and it does not require a fundamental rewrite in core systems, merely the ability to partition activity. Yet, instead, we typically pay for humans to make a change and watch a process; perhaps reasoning that if we must have someone around to handle unknown-unknowns, we might as well have them do the other ones, too.

So, it really does come down to the problem of unknown-unknowns: what are the lurking unanticipatable problems present in the system that prevent us from turning an emergency situation into plain-old ward medicine? Surely, runs the argument, we can’t know these in advance, and therefore we need a human around in order to be able to observe the system in its entirety and take the correct response?

The situation is more nuanced than that, however. Not all problems that could result in outages happen to a system: only some of them will. Not all of the lurking difficulties can be hypothesized in advance, but that’s actually OK because we don’t need to have a solution for every possible problem in advance. Instead, we need to translate problematic states of the system (requiring emergency medicine) into ones requiring business hours intervention (ward medicine). There is a very big difference between solving the general case of all unanticipatable problems and the specific case of constructing software to be much more resilient to unexpected problems within their own domain. Perhaps a little like the halting problem, solving the general case is certainly intractable because modeling program state in order to predict halting is too big; but predicting when a simple FOR loop will halt is trivial. It is absolutely true that there have been many sizeable incidents in which extremely delicate effects have played a part in major outages. However, most people do not ask themselves why those delicate effects emerged in the first place, and this is partially because the state of the industry does not seek to build reliable systems out of well-known building blocks that fail in particularly well-understood ways; instead, unfortunately, the state of the art today is that most successful application layers, whether for startups or for huge multinationals, are reinvented time and time again from the ground up.18

This leads to a situation in which a number of building blocks are indeed composed together, typically in some kind of microservice architecture, but because each organization does it from first principles every time, there is no meaningful industry-wide cooperation on a single stack for serving, data processing, and so on that could produce resilient and well-tested software units. It’s as if the construction industry derived bricks from first principles every time it built a house, and a row of houses would only share bricks because the team members happened to sit next to each other at lunch.

To put it another way, part of the reason we see unknown-unknowns to having bad effects on systems in the first place is because the fine detail in how systems and code interact is poorly understood. And the reason they are poorly understood is not because people are stupid or software is hard (neither of which is necessarily true) but because each team needs to understand things from the beginning. If we had a consolidated set of components that behaved in well-understood ways, we could offset these risks significantly, perhaps completely depending on the domain. Or, putting it in the language of solutions rather than problems, we need a safe cloud stack—or at least cloud components that behave in safe ways.

Another way to think about this problem is to think about the set of postmortems you’ve assembled for your service over the years. When you look at the set of root causes and contributing factors over a long enough period, you can ask yourself the questions: what proportion of those outages were genuinely unforeseeable in advance, and what proportion of them would have been remediated if fairly simple protections had been put more consistently in place? My experience suggests that, as per the earlier analysis of the ER, 80% of outages are caused by 20% of root causes; the rest is in the unknown-unknowns bucket. We can attack those in turn by building more resilient systems that fail safely, can allow for canarying, and separate application logic from a solid systems layer.

Counterarguments

An important counterargument is that the previous observations are all very well, but here in the current universe, with bespoke software aplenty and limited budgets, there is no prospect of avoiding unknown-unknowns, and so we are still locked into on-call for the indefinite future.

This might well be true, in the sense of there being lots of software and not enough money, but almost every piece of software everywhere does go through a rewrite cycle at some point. The prospect of adopting key reliability frameworks for cloud consumers, particularly if they are easy to use and cover very common cases (HTTP servers, storage, etc.) is not as far removed as you might think. It will take time, certainly, but it is not impossible.

Another counterargument is that this will prevent on-call engineers from understanding and effectively troubleshooting problems when they do arise; therefore, we should continue to do on-call across the industry. Well, as outlined earlier, it is true that one valid reason for doing on-call is the wisdom of production, and electing to do so is perfectly fine. However, if the objection is to the idea of an industry population of SREs that is increasingly feeble at on-call, the idea of this article is to move away from on-call as emergency medicine, and toward on-call as ward medicine, where observation over time and support from colleagues is available. I do not think we will remove every failure from every piece of software ever, just that it is certainly more possible to remove enough of them that we don’t need to suffer the costs of being on-call that we do today.

Some commentary states that19 machine learning might be an effective substitute for human on-call. Permit me, for a moment, a more skeptical view. Although I would gladly welcome a piece of software able to cope with anything, in my opinion an arbitrarily complicated unknown-unknowns failure situation would require the machine learning software to fully understand the stacks it is working with, which is not possible today and might never be possible.

Finally, it could be pointed out that building blocks exist today and are being used by engineers everywhere for specific tasks (Kafka, Spark, Redis, etc.), and yet we still have this industry-wide problem. Of course, the adoption of one framework or toolset might help with one class of problem, but there is nothing today that matches the description of a hardened cloud stack with known good choices for each functional element. As many of the use cases must be covered as possible, or too much is left to chance.

The Cost to Humans of Doing On-Call

A more pointed counterargument to SREs performing on-call is perhaps to be found in human factors analysis, cognitive psychology, and the general effect of stress on the human being.

In general, humans perform quite poorly20 in stressful situations, which is overwhelmingly what on-call is. Not only that, it is extremely costly to the individuals involved. I will cover this in more detail shortly.

But the larger picture is that, yes, this matters because of the human toll, but also because it undermines the argument that there is no effective substitute to human on-call. This in turn is because the industry, outside of operations engineers themselves, has very much an incomplete understanding of the costs of doing on-call. Furthermore, because of the rationale underpinning the “necessity” for human on-call—to wit, unknown-unknowns—business owners believe that whatever the cost is, they are committed to paying it because they see no alternative. Precisely because people assume that there is no alternative to human on-call, it is rare to see these costs fully outlined, except at operations conferences and in one-to-one conversations. Which at the very least is a pity, because if we do not know precisely what we are paying, we cannot know if it is worth it.

Let us leave the exact definition of “stressful” to one side for a moment, and for the purposes of this paragraph presume the typical properties of SRE on-call fulfill that definition—to wit, out-of-hours requirements to work; potentially large or unbounded financial/reputational impact to the organization as a whole, depending on your individual performance; sleep disrupted, truncated, or in extreme circumstances, impossible for the duration of the incident; and, if in an out-of-hours context, potential difficulty obtaining help from one’s colleagues.

Humans are not in fact natural smooth performers in stressful situations. Studies of human error, specifically of stress in the context of on-call, are rare; there are many precedents in similar domains, however, including programming itself, chess, and industrial situations, such as nuclear power plant meltdown situations. Ultimately, of course, most of these are in some way inaccurate proxies for real-world performance, but remain, for the moment, the best that we have. The more generic studies on human error seem to arrive at a background rate of between 0.5% and 10% for various “trivial” activities, including typing, reading a graph, and writing in exams.21 Similarly, A Guide to Practical Human Reliability Assessment (Kirwan) shows a table that lists “Stressful complicated non-routine task” as having an error rate of about 30%. Dr. David J. Smith’s Reliability, Maintainability, and Risk states a similar rate, 25%, for complicated tasks, and somewhat depressingly, a 50% error rate for “trivial” things such as noticing that valves are in the wrong position.22 In a paper on the subject,23Microsoft shows that middle-rank chess players double their chance of a serious blunder as they move from 10 seconds to 0 seconds left on their clock, and background error rates for programming—in the absence of any particular stressor—range between <1 and >13% in this comparison table.

Any way you look at it, it is clear that to err is very definitely human.

Another potentially large effect on on-call performance is cognitive bias, which (if you accept the overall psychological framework), strongly implies that human beings make errors in stressful situations in very systemic ways. A write-up goes into this in more detail here, but suffice to say that if you read Thinking, Fast and Slow and wonder if there’s any evidence SREs are affected by cognitive kinks of some kinds, there is indeed quite a lot of evidence to support it; for example, anchoring effects in the context of time-limited graph interpretation, closely matching the awkward constraints of on-call.

But most of what we’ve been discussing merely (perhaps loosely) supports what you more or less knew already about the human condition, being human yourself. There is also a subtler effect, which is that the fear of on-call is often enough by itself to radically change people’s behavior. Entire development teams reject outright the notion of going on call, because of the impact on their personal lives, family, and in-hours effectiveness. Many such teams are perfectly happy for operations teams without the authority to actually fix problems to keep things ticking over “during the night shift” as best they can—as long as those developer teams don’t have to do on-call themselves.

Additionally, diversity and inclusion in SRE therefore suffers as caregivers—of whatever kind, parental or otherwise—deliberately opt out of a situation that promises to place them in direct and certain conflict with their other responsibilities.

When we turn our attention to the actual effects of serious on-call work on human beings, it makes for similarly, if not more, sobering reading. There is evidence to suggest that even the possibility of being called increases the need for recovery in on-call workers24; sleep deprivation has a long list of negative effects and there is serious evidence suggesting that it shortens life expectancy25; and finally, this survey of papers examining the effects of on-call generally suggests excellent evidence that on-call work can and does have a variety of effects on physical health (e.g., gastrointestinal and reproductive) and mental health (e.g., anxiety and depression). There’s simply no meaningful upside for the practitioners.

All of this is not even to mention the experience of on-call, which is often dispiriting. Not every team looks after their monitoring and alerting as assiduously as they should, so an on-call engineer can be assailed on their shift with noisy alerts, alerts that are mainly or wholly unactionable, monitoring that doesn’t catch actual outages and actual users that do, poor or nonexistent documentation, blame-laden reactions when things inevitably go wrong, a complete lack of training for doing it, and, worst of all, a systemic lack of follow-up to any of the issues discovered, meaning that one braces oneself each shift for the structural unsolved problems that paged one the last shift. And in some companies, you do all of this without either extra financial compensation or time-off-in-lieu.

Clearly, as a species, we don’t like doing on-call, we’re not terribly good at it, it’s actively harmful for us, and it can often be one of the most unpleasant experiences we have.

Given all that, I ask again: why aren’t we talking about meaningful alternatives?

We don’t need another hero

Perhaps part of the reason is us.

The mission SRE has—the protection of products in production—mixes well with that cohort of people motivated to “step up” and work hard at resolving production incidents. But throwing oneself relentlessly against a production incident, although in some ways admirable, and in another sense what we are paid for, has at least as many drawbacks as it does merits; in particular, the negative consequences of heroism.

The bad consequences of heroism are simultaneously subtle and coarse. Many people like the approval of their peers, and also like the satisfaction of knowing that their work has had a direct (even positive) effect on their team, the system they support, their company, and so on. So, there is a direct psychological link between stepping in to be The Hero fixing the problem and the approval one can get from fellow team members, the product development team, one’s manager, and so on. Both explicit and implicit incentives to repeat hero-like behavior can evolve. Perhaps company management comes to expect it; you kept the system going last time in this way, why aren’t you doing it this time? Worse than management in some ways, perhaps your peers come to expect it, particularly in positively oriented cultures like Google where peers can award each other small bonuses at the click of a button.

But worse than that in turn, when a heroine steps up and fills a particular role, often out of hours or in demanding circumstances, this means they’re not doing work they’d otherwise been scheduled to do. Therefore, a replacement heroine is required to do the things that wouldn’t otherwise get done, and another in turn, and so on. Granted, a team often must make do with what it can, but a long-running production incident has definite physiological effects and someone has to pick up the slack. If this can’t be the directly affected team member, it’s someone else from the team, and so on. Not to mention the bad effect of modeling hero culture26 to the rest of the team and seeing it being rewarded.

Finally, it is worth noting that there can be a dichotomy in how we see ourselves versus how our value is perceived by others. As we just discussed, the profession takes on-call very seriously and tries to be good at it; yet, it is very, very rare to be promoted as a function of on-call performance. In 11 years at Google, I never saw it. It was difficult to be promoted if you were bad at on-call, but it was impossible if you were good at on-call and bad at other things. Alice Goldfuss talks about this in her 2017 Monitorama talk. The peculiarity of being rewarded in small ways for heroic behavior and yet being denied larger rewards as a consequence of that same behavior is unsettling.

Yet, despite all of this, it is still seen in many quarters as a sign of weakness when someone dislikes on-call. Perhaps this is connected to our unwillingness to even think about alternatives to on-call, other than leaving the profession for a role without it.

Actual Solutions

Summarizing everything up this point, the chain of argument is as follows.

If we accept the preceding facts—that on-call is used for many things, but primarily to cover the availability gap during software failures provoked by unknown-unknowns, and that humans are not in fact very good at it and it is bad for them to do it, it seems natural to ask if there’s anything else we can do.

Broadly speaking, we can try to make the existing situation better, or we can try to do something fundamentally new.

For making the existing situation better, let us subdivide this into training, prioritization, accommodations, and improving on-the-job performance.

Training

Training is one of the worst problems, which is surprising given that it is also, in theory, one of the easiest to solve. Part of the reason for this is because of the vexed question about when in a career we might expect such training to take place. For example, there is, as far as I’m aware, no academic-level treatments of on-call in the course of any computer science degree anywhere.27 Do please correct me if I’m wrong. So, people new to the sector often come to the on-call portion of a job completely unequipped to either support, critique, or modify the on-call situation they find when they get there. It is easy and very understandable to chafe at that responsibility and decide that the right thing to do is to (effectively) have others suffer. This does of course set up a lot of the bad relationships and contributes to incentive structures that end up making it worse for everyone. The good news is that there are multiple venues where best practices are discussed, so even if there is a dearth of publicly reshareable material—perhaps Site Reliability Engineering’s chapters on on-call and troubleshooting are a reasonable place to start, although PagerDuty’s training materials perhaps assume less about what the reader knows—it’s still possible to improve. In any event, there are useful, widely available supports on how to do so.

Prioritization

But the larger part of improving is actually wanting to improve. That is firmly in the domain of culture, as Cindy Sridharan controversially but not unfairly pointed out.28 In this context, a better culture would involve teaching product developers that actually, the way to have a better system while on call is not to sternly resist any attempts to be put on call, but instead to prioritize the fixes and engineering effort required to make it better when one does go on call. A piece of software that is improved in its operational characteristics is a universally better piece of software

Note

For some nonexhaustive arguments why, see the SRE book’s chapter on automation.

However, if a dichotomy emerges where product developers are the group most capable of improving a system’s behavior, and yet are the most insulated from when the behavior is bad, nothing good will result from this broken feedback loop. The business might happen to be successful, but it will always be paying a cost for decoupling that loop—whether it is in resource costs, staffing attrition costs, or agility costs—and fixing that eternal cost is, in a way, part of what SRE promises. A similar argument applies to postmortem follow-up actions.

Accommodations

Making accommodations for folks who are on call is also a key remediation. If more accommodations were made, on-call would inspire less fear, and more people would be able to do it. If more people were able to do it, the work would be spread across more people, which means more progress could be made. If more progress could be made, some time could be spent on operational clean-up work, and on-call would inspire less fear. A virtuous circle.

Accommodations include but are not limited to: compensation for on-call work, particularly out of hours; reasonably flexible schedules so caregivers and others can move on-call work around (depending on how onerous the shift is, it can make even running the simplest of errands very hard); support for recovery and follow-up afterward; and mechanisms for those who are literally unable to do it to be excluded without backlash.

Compensation

Many people still work in roles for which there is no on-call compensation, or certainly no formal scheme. It is certainly immediately cheaper to run a company without one, but it is very much short-sighted. It is neither a good use of people’s lives nor morally correct. It’s also not even good for the business, despite short-term savings. Instead, we should have a well-articulated industry-wide model, or series of models so that each organization can pick the one that’s best for it, and the engineers working for it can pick and choose accordingly. A huge advantage of acknowledging the domain of human obligation in on-call (by providing supports for it) would be increasing the pool of people who elect to do it, with consequent benefit for team diversity, team members’ lives, and so on.

Flexible schedules

To enable flexible schedules, you should address the sources of inflexibility. Management might need to be convinced that performance will improve and hiring will be easier with more flexible schedules. Hours of coverage can vary, depending on when the statistical likelihood of outages happens. A formal Service-Level Agreement (SLA) helps immensely with not only Service-Level Objectives (SLO)–based alerting, which often lowers raw paging numbers, but even the act of negotiating one from where there was none before can help business owners to figure out that they don’t actually need to try (fruitlessly) for 100% availability. This in turn enables more flexible schedules.

Recovery

Most on-call compensation schemes that I’ve seen focus on enabling compensation or time-off-in-lieu, but this does not necessarily happen immediately after a tiring shift. Companies could move to offering a morning off from the night before, after the activity during the shift had passed a certain threshold, outside of the terms of any time-off-in-lieu arrangement that might operate.

Exclusion backlash

Making sure people feel safe on your team—no matter what their attitude to on-call is or capability to do it—is mostly part of successful line management rather than company policy. But companies could signal in advance that they adopt a “opt-out of on-call without retaliation” policy, which would again help to attract more diversity in their workforce.

Improving On-the-Job Performance

On-the-job performance is in many ways the least important component of attacking the problem of on-call. For a start, it doesn’t actually advance the agenda of getting rid of it. Additionally, bad but passable on-call performance is unlikely to get you fired. If it did, an organization might well end up removing more people from an on-call rotation than are needed, resulting in an ever-increasing spiral until the subset of people left behind burn out and quit.

Instead, typically speaking, what happens is that only the people who make egregious mistakes are let go, and what most management cares about is a good faith effort to solve a problem, not consistently low Mean Time to Recovery (MTTR) metrics in a Taylorist-style29 scheme. Therefore, after the on-call person shows up and makes a good effort, there’s usually little enough incentive within the system itself to get better at it beyond not wanting to write yet another postmortem this week.

Having said that, there are many techniques one can use to improve.

Cognitive hacks

For a start, there are many cognitive “hacks” that can improve performance in on-call situations; for example, frowning can help us to feel less powerful, which in turn makes us less likely to “anchor” on specific scenarios to the exclusion of all else; reading John Allspaw’s materials on blameless postmortems can also help to shake out assumptions about what is actually at fault when things go wrong in distributed systems, and helps to increase self-reflection, which is a crucial component of avoiding pure reactivity. As discussed earlier, reacting is a serious impediment to finding out complex truths, and there are no other types of truth in distributed systems.

Another potential hack is doing pair on-call; it has the very useful effect that when you need to explain yourself to someone else, having to articulate your idea often helps you figure out if you’re wrong, just by saying it. This is different from active/inactive primary/secondary shifts because it implies that the people are working closely together interactively during the shift. (This is most tractable during business hours and is hard to organize with small teams.) Aggressive hypothesizing, which is the technique of tossing out idea after idea about what’s going wrong, all as different as possible, can also be useful. This is particularly so when a number of things go wrong at the same time, due to the fact that increased stress narrows the mind. It also helps to correct for cognitive bias problems like anchoring. Finally, good discipline, such as always maintaining hand-off documents and following a well-drilled incident management procedure, is necessary to coping effectively with rapidly moving incidents; the “guide rails” provided by extensive drilling on a procedure helps you to react correctly in uncertain situations.

We Need a Fundamental Change in Approach

As useful as all of these practical recommendations are, there is something fundamentally unsatisfying with always chasing something you’ll never catch up with. A more interesting question, then, is what different things can we do to fundamentally change this situation?

Of course, this depends on how you characterize the problem. Recall the central conflict at the heart of software systems: we know that they are deterministic,30 yet they continually surprise us, in the worst of ways, in how their complexity and their behaviors interact to produce outages.

But how does that unreliability arise? A survey of Google outages in The Site Reliability Workbook suggests that binary and configuration pushes together constitute almost 70% of factors leading to an outage, with software itself and a failure in the development process turning up in almost 62% of root causes, and “complex system behaviors” being in only around 17% of postmortems. Remember these numbers, or at least the fact that analyzed outages have a structure to their causes.

At this point, let me introduce two positions that I refer to in the rest of this chapter; Strong-Anti-On-Call (SAOC) and Weak-Anti-On-Call (WAOC).

Strong-Anti-On-Call

Strong-Anti-On-Call (SAOC) runs as follows.

Software systems are deterministic. We have two possible approaches to attacking the problem of outages. We can either remove the sources of the outages, or we can prevent outages from having a catastrophic effect. Because software systems are deterministic, if we remove all sources of outage, the system will not fail. (SAOC does not believe that doing both of these things is useful.)

Let us take a moment to recapitulate what the sources of unreliability are when constructing a software system. We can make a simple programming error, a typo, or equivalent. We can make a design error, constructing something that is guaranteed to go wrong. We can insulate ourselves from the environment incorrectly (libraries, dependencies, or simply parsing data incorrectly). We can treat remote dependencies incorrectly—for example, behaving as if they will always be reachable or will always return correct data.

As discussed earlier, the maddening thing about outages is that so many of them are entirely avoidable. The SAOC position involves preventing each of the identified sources of error in the list of categories in “Underlying Assumptions Driving On-Call for Engineers” from manifesting in your system. Although there is too much to address in detail here, the good news is that many of the difficulties of change management are already well understood, and what we are trying to do is more successfully implement something that is already relatively well understood.

The bad news is that, if we keep getting it wrong, there might be a reason for that.

However, to SAOC, this does not matter. We’ll get it right eventually. But we do have one hurdle to overcome first, which is the causes of outages that result from larger interactions, not necessarily system failures within the system itself or from simple “one-step” interactions, such as ingesting bad data.

So, we need to take the complexity out of our systems. Practically speaking, there are really only two known ways of doing that: building simple subcomponents with verified, known behavior, composed in deterministic ways; and running systems for a very long time to see where we were mistaken about their stability and then fixing those problems. Today, as an industry, we make it too easy to write unreliable software, by not consistently taking advantage of those two techniques and facilitating the fact that it’s still software engineers finding it easier to write code than read it that creates unreliable code.

Instead, I argue we need to change the basic layer at which we build software. Today that is POSIX libc, win32, or the logical equivalent; in the future, particularly the distributed future, it must be at a higher level, and with more cross-cloud (or at least, cross-platform) features. To write a server, a product developer should pluck a well-known class off the shelf, with good monitoring, logging, crisis load handling, graceful degradation, and failure characteristics for free and enabled by default. It should be actively difficult to write something bad. Innovation long ago moved out of the platform layer, yet we pay so much for the right to rewrite from the ground up, and gain so little in return, it is difficult to understand why we tolerate it.

Simultaneously, we need better ways to insulate bad application layer logic from the rest of the platform; today, across the industry, such insulation is essentially infinitely gradated across the spectrum—from none at all, which is unfortunately common, to completely isolated, which is unfortunately uncommon. Again, this speaks to a preexisting toolkit that is easier to use than not.

Thus, SAOC finds itself arguing for a methodical approach to eliminating sources of error and a toolkit approach, not because of resilience, but because of eliminating complexities of interaction. There are a number of weaknesses to this position, but one glaringly obvious one is that the methodical elimination is only ever going to happen after a software system is deployed; ideally, we would be reusing only pieces of software to which this elimination had already been applied. That argues for a toolkit approach again, except one in which the software components have already been “ground down” to deterministic subcomponents.

Weak-Anti-On-Call

If you have come this far but do not agree that the strong case is either practical or worth striving for, allow me to convince you of a slightly weaker but still useful position.

In this view of the world, you might indeed believe that software is deterministic, but you are convinced that we will never be able to react successfully and programmatically to unknown-unknowns and that interaction between arbitrarily complicated systems will always produce failure of some kind. We can still make progress on eliminating on-call, except we need to look at it in a different way.

Unlike with SAOC, we are not trying to eliminate emergency medicine by replacing it with ward medicine. Instead, we are trying to implement driverless trains: systems for which we know we cannot control the environment completely, but we more or less have one useful reaction (stop the individual train, if not the entire system), and the question is to what extent we can stop the train and not imperil the system as a whole until the human driver arrives.

This approach therefore relies not on attempting to prevent human intervention from occurring, but on delaying it until some notion of “business hours” arrives, or otherwise removing it from the domain of emergency medicine.

In this view of the world, the most important thing to do is not to prevent outages, but to insulate the system against failure better. The strange thing is that this position actually looks quite similar to the previous one: we need the same thing, reusable standardized toolkit software, except instead of optimizing for complexity reduction, we optimize for the equivalent of driverless train failure: we automatically stop in the safest way possible, depending on what we can learn about the circumstances. It is interesting to consider how much thought a typical engineer puts into having a software system fail safely as opposed to making forward progress successfully. Failure cases are often disregarded in the hazy optimism of a favored editor, and there is every reason to suppose that we would be more successful at engineering resilience if this was more of a focus or, more important, if more guide rails were provided to do so.

The details of safe failure would vary so much from domain to domain it is not efficient to talk about them here, but the key principle is again that the work performed should be amenable to partitioning such that components of the system that encounter fatal errors can be safely removed, with no significant impact on system capacity, or at least a limited one; enough to give the human operator a little leeway to deal with it in the office. Then it is a matter of scaling the system such that the failure domains meet this goal, and then you can avoid on-call, safe in the knowledge that you might be drastically over-dimensioned in hardware, but at least you won’t be paying attrition-related staff costs.

A Union of the Two

Yes, there is nothing to stop us combining the viewpoints of both approaches. In fact, even though the theoretical positions are quite different, you have already seen that both of them call for same remedies: in particular, standardized toolkits for the construction of software, albeit for different reasons.

It seems clear to me as, a result, that the industry needs put effort into making very reliable subcomponents, out of which most services could be composed; they might even be proved formally correct,31 harking back to the cloud stack conversation outlined earlier. Yet I agree that it feels a little unrealistic to suggest this in a world where a startup can become wildly successful with some hacked-together Ruby that just happens to be a good product-market fit for a particular use case: today, what drives usage is the product-market fit, not the operability.

Furthermore, as long as a huge multinational corporation finds it cheaper to hire operations engineers to take pagers and reboot systems than to fix real problems with their software operability, there cannot be any real change. For real change, we must fix the problem at its point of origin: we must change how easy it is to write unreliable software. We must make it, if not impossible, then actively hard to write poorly operable software. Anything else will not lead to a fundamental change.

The benefits of standardization on a cloud stack also play to the assumption of the SAOC position: it is way easier to methodically eliminate sources of error if you’re not rediscovering them separately inside your company each time. A single attack surface, inspected by many eyeballs, will surely converge more quickly than multiple attack surfaces inspected by fewer.

Conclusion

To get rid of the scourge of on-call will require an industry-wide effort, given that what we are proposing is collaboration on toolkits explicitly designed to be rewritten as little as possible, yet used by as many people as possible.

But if we did do this, the benefits would be incalculable: an industry-wide set of architectures so stable they could be taught in schools; a consistent approach to layering business logic; a more welcoming environment for caregivers and minorities; a set of methodically applied best practices and consistent data processing across companies, never mind within teams in the same company, never mind within individuals in the same team!

It might sound impossible, but in essence is a task of convergence. As an industry and indeed in society, we have in the past converged on odder things: VHS, the x86 instruction set, and English come to mind, but there are surely more. Now the task is to push for it, because it would be of benefit to all, even if it is only us right now who see what the future could be.

Contributor Bio

Niall Richard Murphy has been working in internet infrastructure for over 20 years, and is currently Director of Software Engineering for Azure Production Infrastructure Engineering in Microsoft’s Dublin office. He is a company founder, an author, a photographer, and holds degrees in computer science and mathematics and poetry studies. He is the instigator, coauthor, and editor of Site Reliability Engineering and The Site Reliability Workbook.

1 See, for example, “How on-call and irregular scheduling harm the American workforce” from The Conversation, or “Why You Should End On-Call Scheduling and What to Do Instead” from When I Work, outlining the impact on income and family; the costs to the systems themselves are hard to estimate, but “friendly fire” in on-call situations is estimated to occur in over 1% of on-call shifts.

2 Bletchley Park and its complement of WRNS (Women’s Royal Naval Service) on-call operators.

3 According to, for example, Tom’s Hardware, around the Bletchley Park era, “in a large system, [a vacuum tube] failed every hour or so.”

4 See, for example, this MedicineNet article or this free medical dictionary, making specific reference to being reachable in 30 minutes of being paged.

5 “Accident & Emergency” in the UK/Ireland; Emergency Room (ER) in the US.

6 Note that doctors get a lot of automatic alerting as well, it’s just that it seems that a lot of it is very low quality; see, for example, this Washington Post article.

7 See, for example, this article from Medical Protection Ireland, emphasizing not eating junk food, paying bills in advance of a week of night-shift work, and double-checking calculations made during night shifts.

8 For example, this article claims that 5% of their ER admissions gave rise to 22% of their costs; this piece argues more broadly that Pareto Principle–style effects are distributed throughout medicine; and this article showed that adverse drug effects obeyed a Pareto Principle–like distribution across a sample of 700-plus cases.

9 As best I can tell, this situation is unique to software: industries that deal with very complex hardware, such as airplanes, do have problems related to complexity, and uncover latent problems with particular revisions of sensors, and so on, but the nature of software being changed all the time is found, as far as I know, nowhere else.

10 On-call in the medical professional also serves as a triage function, which is partially outsourced to monitoring software in the SRE case.

11 Leaving aside the considerable problems with persuading the public, this would be a good idea.

12 This applies to operations engineers generally, and sometimes to product software engineers.

13 For the purposes of this footnote, I want to attack the notion that failure is unavoidable and that everything in computing is wobbly stacks built on soggy marshes of unpredictability. This is not the case. There are large classes of software systems that run without issue for years. In fact, we might propose (say) a Murphy’s First Law of Production: a system operating stably in production continues to operate stably in production unless acted on by an external force. This is true for many embedded systems, disconnected from the internet. It is not largely true for non-embedded, internet-connected systems. Complexity is surely part of the answer why, but there is something we are missing.

14 See this definition of the phrase.

15 See, for example, this wonderful article.

16 See, for example, this definition.

17 Note that I am being deliberately careful with my wording here. Canarying has limits: for example, you shouldn’t canary in a nontransactional environment in which each operation is “important” (as opposed to a simple retryable web query); an operation will irrevocably alter state (perhaps with some monetary value attached) and yet can’t be rolled back. You also can’t canary in an environment in which fractional traffic isn’t routable safely to a subset of processors. Running a canary fleet is more expensive, too, unless you’re cannibalizing your production serving capacity, which is problematic on its own, and, finally, your cloud/colo/etc. provider might not provide easy hooks for canarying. But I (carefully) don’t say, “canarying can solve everything.” I say, “It’s hard to think of a situation where you can’t canary to good effect.” Those are different statements, and I expect that canarying could solve a very wide array of problems that it doesn’t solve today, mostly because we put people on call rather than go to the time and expense of setting up a canarying infrastructure.

18 In this context, “ground” often equals POSIX libc, although it shouldn’t.

19 See, for example, this devopsdays talk by Hannah Foxwell.

20 Poorly in comparison to what, is a valid question: here, I mean compared with a notional resolution that involves no mistakes or blind alleys, but does involve the usual delays in detection, starting analysis, and so on.

21 See, for example, Ray Panko’s site for a comparison table.

22 More shocking and yet also completely believable is a line item stating that failing to act correctly within 1 minute of an emergency situation developing has a probability of 90%; see this article for more.

23 Ashton Anderson, Jon Kleinberg, Sendhil Mullainathan, “Assessing Human Error Against a Benchmark of Perfection”.

24 See this paper on on-call fatigue.

25 Mark O’Connell, “Why We Sleep by Matthew Walker review – how more sleep can save your life”.

26 See, for example, Emily Gorcenski’s talk at SRECon Europe 2017.

27 In contrast with medicine, as discussed earlier.

28 Cindy Sridharan, “On-call doesn’t have to suck”, Medium.com.

29 Frederick Taylor was a scientific management theorist who introduced the all-too-successful idea of dehumanizing people in work situations by closely managing metrics.

30 So they keep saying, anyway.

31 This is not as ridiculous as you might think; for example, AWS uses formal methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.65.247