Chapter 23. SRE Antipatterns

Human brains are built and trained for threat avoidance. We might be terrible at weighing relative risk,1 but we’re excellent at picking out the one thing in a pile of other things that looks like a failure mode that we’ve seen before.2

Let’s face it. Failure is fun!3 And failure makes a good story. So, it can often be both easier and more effective to catalog the things you shouldn’t do rather than just the things you should.

But “antipatterns” are not your average “this one time, at Foo camp” Tale of Fail. They’re the things we’ve seen go horribly wrong not once, not twice, but over and over again. Antipatterns are attractive fallacies. Strategies that succeed for a little less time than you will find you needed them to. Common sense that turns out to be more common than sensible.

Throughout the rest of this book, you’ll find examples of things that you should do. That’s not what I’m about here in this chapter. Think of this section as your “Defense Against the ‘D’oh!’ Arts” glossary. Or just sit back and enjoy imagining all the stuff I and a host of colleagues past and present had to screw up in order to get to the point where I could share this short list with you. SREs are not perfect. Some of these mistakes I’ve even made more than once myself. That’s why they’re antipatterns.

Antipattern 1: Site Reliability Operations

A new mission cannot always be achieved with old tools and methods.

Site Reliability Operations: The practice of rebranding your operations as Site Reliability without fundamentally changing their approach to problems and the nature of the work they are expected and empowered to accomplish.

 

Site Reliability Operations is not a thing. Site reliability is a software, network, and systems engineering discipline. You cannot take a bunch of technicians sitting in a Network Operations Center (NOC), give them a GitHub account and a public cloud budget, tell them to move some stuff to containers, and magically rebrand them as SREs.4

The NOC is an outgrowth of several outmoded ideas. The first is that there are specific people whose job is to keep the systems that have already been built running at all costs. SREs don’t do this. SREs build systems to require less human intervention and to fail less often, and they modify existing systems to remove emergent failure modes. They do not babysit, or feed the machine with blood, sweat, tears, dinosaur grease, or any other biological product.

SREs should spend more than half their time building better systems, rather than conducting or documenting operational tasks. In a word, they should be engineering. Good engineering requires flow. Flow dies in an interruptive environment. Give your teams the time and space they need to keep ahead of the technical problem set by doing engineering, and you will get increased efficiency in all things, even as you increase scale, velocity, and scope.

We’ve seen a lot of people coming in to SRE conferences talking about the NOC they’ve built for their SREs. NOCs are cool. They’re inspirational. The best of them can make you feel like a hero with the fate of the world—or at least the business—riding on your shoulders. But hero culture is an antipattern in and of itself, and SREs don’t work in a NOC, even though NOCs originally evolved for very understandable reasons.

Sometimes, you can’t beat the communications bandwidth that comes from having everyone working on a problem in the same physical space, but the tools they’re using to do that work shouldn’t be tied to that room and neither should the jobs and/or people. NOC’s aren’t conducive to good engineering work.

The NOC is the most open of open plan offices, with extra blinky and noisy distractions thrown in on top of the sea of coworkers to boot. And it is incomprehensible how our industry, which prides itself on being data driven, remains so willfully data-blind to the growing scientific evidence of how utterly unsuitable the open-plan office is for the work conducted by engineering teams.5

Don’t spend your time and treasure building rooms that try to bring ops people closer to the machines and each other 24/7.

The key here is distributed sharing and collaboration from anywhere, so that just the engineers who should actually be on call at the moment can respond immediately without leaving the productive comforts of their home, office, or really well-designed cube.6 If you want to share a link to a particular plot of a time series, you should be able to post it into a chat or an incident response tool where everyone interested in the incident can then look at the same plot with the same filters for traffic, start and end time, resolution, and so on.

Ideally, this tooling should share live data, not just a static graph or a screenshot, so that folks can use it as a starting point to play with theories, and to dig in and find discrepancies or alternate explanations for whatever is being held up as odd or offered as a cause. Any ephemeral data from these live links should be preservable with a checkbox/flag-type operation so that it will be available later for your postmortem.

Having to have everybody who wants to discuss something have to put it up on a shared monitor where other folks can’t poke at it—like you do in an NOC or an old-school war room—isn’t as good. More brains and hands able to manipulate information directly while still preserving collaboration and sharing will get you to a remediation of your issues faster and more consistently. And freeing your engineers from the NOC will improve their ability to deliver on actual engineering work.

Antipattern 2: Humans Staring at Screens

If you have to wait for a human to detect an error, you’ve already lost.

Humans Staring at Screens: Any practice for which the detection of a problem condition relies on a human noticing that a particular series of data is abnormal, or a combination of several datasets is problematic, or that a particular condition is relevant to a known error or outage rather than relying on thresholds, correlation engines, velocity metrics, structured logs parsers, and other tooling to detect those conditions and surface them for analysis by only the relevant humans.

 

Another old NOC paradigm is that having human beings looking at data—even partially aggregated or correlated data—is a good way to detect and respond to potential problems before they get too bad. It is not. It is an adequate way, but it is not good. Machines are much better at finding patterns in large datasets, and they should be used to do so whenever possible.

Even modeling large amounts of data in statistically valid and yet still humanly understandable ways is difficult, let alone consuming it in any quantity or over any prolonged period. Don’t spend your innovation and attention on getting a feel for constantly evolving complex systems. Machines don’t need lots of tricky user experience (UX) to consume structured data. Feed it to them and then focus on grooming only the bits that matter for human consumption.

Instead of watching graphs and manually feeding an alert or ticket into a server to document and coordinate response when you detect a problem (people feeding worthwhile work to machines), build systems that can watch data for you and detect when something is going wrong. Preferably systems that can attempt some form of automated response to it, before alerting a human if the canned playbook response doesn’t resolve the condition (machines feeding people worthwhile work), both to prevent interruptions and to speed recovery, because humans can’t process incident detection and response fast enough without it as Service-Level Objectives (SLOs) begin to creep above four-9s.

Build tools that make it easier for engineers who spend most of their time doing engineering work in good engineering conditions (not NOCs or open-plan offices) to be notified reliably and immediately when something needs human attention and to achieve rapid access to information, systems, and one another as necessary.

Antipattern 3: Mob Incident Response

Keep your eyes on the ball, but your feet in your zone.

Mob Incident Response: All-hands-on-deck incident handling with little thought to coordination of effort, reserves, and OSHT7 troubleshooting, sleep cycles, human cognitive limits, or the deleterious effect of interrupts on engineering work.

 

One of the other problems engendered by carrying over the NOC model to SRE teams, or even of using distributed systems that don’t carefully scope the alerts they generate, is that the natural human tendency is for everyone within the reach of an alert or a troubling graph to sort of pile on and begin poking at the problem. Or, at the very least, they are pulled out of their flow and have the condition taking up attention at the back of their brain until it is resolved.

Not only is this disruptive to engineering work, but without extremely good policies and discipline surrounding incident response, you can actually increase the time to analyze and resolve an issue. This is especially true if multiple people begin making changes to test multiple hypotheses simultaneously, creating unplanned interactions that prevent anyone realizing they’d found the solution to the original problem or hiding/destroying evidence that would help in tracing the sources of the issue.

Even if you avoid such complications through good coordination, teams can end up in situations where a problem drags on without an imminent resolution, and because everyone piled on immediately, there’s no fresh set of eyes or second shift to come in and manage the situation as the first responders’ effectiveness begins to wear down.

The Incident Command System (ICS)8 provides a good procedural framework for handling such situations and learning and implementing something similar9 can help no matter what your tech stack or working environment looks like.

That said, we all know that relying on humans to follow procedure consistently in abnormal situations is not the best choice for avoiding problems. Why place the burden on the people you work with to do the right thing every time?

Build your detection and alerting and incident management systems to allow the necessary people to engage fully with managing the problem while protecting and preserving the attention and energies of the others until they are needed. Sooner or later, you will be glad for the foresight.

Antipattern 4: Root Cause = Human Error

If a well-intentioned human can “break” it, it was already broken.

Root Cause = Human Error: Blaming failures upon the well-intentioned actions of a human who came to an erroneous conclusion as to the probable outcome of a given action based upon the best understanding of the system available at that time or, more generally, reducing the explanation for any unwanted outcome to a single cause.

 

When systems break, it is good and right to look for factors contributing to the failure so that we can seek to reduce the likelihood that the system will fail in that same manner again. The desire to prevent such recurrent failures is a very powerful incentive to identify causes that can be understood and remedied. When, in the course of such an investigation, we arrive at a human making a choice that brings about an unintended and harmful consequence, it is very tempting to stop there, for several reasons.

It allows us to feel we can shift from an investigative mode to “fixing the problem,” which gives us closure and the thought we might be protected from a future problem.

It is also objectively harder to map all of the possible chains of contributing factors inherent in a person as opposed to a machine. Figuring out the context in which that person was operating—their knowledge, the exact actions they took, and the myriad possible inputs and factors they bring with them, not only from the work environment, but even from their life outside the immediately relevant interaction—is toilsome and imprecise at best and can produce outright misleading and erroneous data and conclusions at worst.10

We have collectively cut our teeth on decades of dualistic explanations seeking to determine whether the “cause” of an incident was a hardware failure or a human screw-up. But even “purely” mechanical failures could probably be traced back to a human error of commission or omission by an omniscient investigator. The bearing that seized wasn’t lubricated properly, or cooled too quickly during manufacturing, or was dropped, or inserted with too much force, or should have been replaced more frequently, or, or, or.

And the same is true conversely for “human” causes. If a human didn’t perform any of those actions appropriately, there was probably a mechanical test that could have caught it, or a greater tolerance that could have been built into the part or the system, or a more visual checklist indicator that could have made the failure to perform maintenance an immediately apparent and recoverable error rather than a catastrophic production failure.

There are two seductive but unhelpful tendencies at play here. The first is the misperception of human or hardware or software or any other particular class of error as a cause of failure rather than an effect of a flawed system that would inevitably generate failure again, given a chance.

The second is the notion that post hoc incident analysis can or should actually be reduced to a “root cause” at all. Postmortem analysis should have as a goal a thorough understanding of all aspects of a failure, not the search for a discrete smoking gun.

All too often, the “root cause” is just the place where we decided we knew enough to stop analyzing and trying to learn more, whether because it was difficult to go further or because the step in question matches well with a previous condition we think we understand or for which we have a likely solution.

Instead of a root cause, or even causes, I like to think of contributing factors, whereby any one of them might not be enough to cause the observed behavior, but they all contributed to a pattern of failure that ultimately became perceptible to users of the system. Analysis needs to allow for systems where a surprising and/or undesirable emergent behavior is the result of a web of many separate conditions, all of which are necessary but none of which is sufficient to bring about or initiate the problem rather than flowing from a linear causal chain of consequences back to the first toppled domino.

The traditional postmortem template has spaces for both, but the most interesting things always turn up in the contributing factors. That’s where we find things that might not even have been the most critical flaw to directly influence the outcome currently under investigation but which might have much more profound and far reaching influence on a broader group of systems and teams and outcomes in the overall organization.

So, don’t hate on those horrible humans, and don’t race to root. Causal analysis is not Capture the Flag. Even if you think you already know what happened and where things “went wrong,” take your time and explore the system as a whole and all the events and conditions leading up to a problem you’re trying to analyze.11

Antipattern 5: Passing the Pager

On-call can’t be off-loaded.

Passing the Pager: Assigning ultimate responsibility for responding to system failures to teams or individuals who did not create the system generating the failures.

 

Another hangover from the old operations world is that a lot of product developers hear about the way Google SREs take the pager for services and think that means incident response isn’t still the product team’s job—that they should get one of those SRE teams so they don’t have to be on call.

This isn’t the first or last time I will say this, but reliability engineering is velocity engineering. One of the key characteristics of highly performant organizations is rapid feedback loops from the moment production code is created, through integration, testing, and deployment, right up to the performance of that production code in the real world.

Divorcing the creation of software from the production consequences of that software by off-loading the pager entirely breaks that feedback cycle, prevents rapid learning and iteration on the part of the product team, and sets teams up for an antagonistic relationship as production teams attempt to gain some measure of control over the behavior of systems created by product developers who have no incentive so strong as the delivery of features.

This is not the SRE paradigm. On-call is a shared responsibility. There are a lot of different patterns for how to share it (product devs rotating through SRE stints, product taking secondary on-call, both product and SRE in the same on-call rotation, etc.), but even where SREs completely own the primary pager response, the product development team needs to participate in the form of a product on-call that can be contacted by the SRE team to speed resolution of any problems that aren’t purely the result of deployment or infrastructure issues.

In all cases, ultimate responsibility for handling production incidents remains at all times with the team developing a service. If the operations load becomes too heavy, the product team needs to take the overflow, fix the technical debt, or might even end up losing pager support entirely.

Of course, if the load is just the result of a healthy service whose operational load is growing sublinearly beyond the capacity of the SRE team, growing the SRE team is an option as well. Some people are tempted to do this even when they know the system is overburdened with technical debt: “Let’s just add some folks for now, and then we’ll fix it down the line.”

But good SREs are even harder to hire than good product developers, so it’s not really possible to hire ahead of the operational load of an undesirably burdensome service without dragging a team down into an operational quagmire from which the organization you end up with will be unable to recover.

It is almost impossible to hire your way out of technical debt. You need to commit your organization to valuing and pursuing reliability and scalability with whatever resources you have, and that means not accepting bad scaling models, whether technical or human.

Antipattern 6: Magic Smoke Jumping!

Elite warrior/hero culture is a trap.

Magic Smoke Jumping: valuing incident response heroics over prudent design and preventive planning. This includes situations where all three are being done, but it is the IR heroics that receive the only or the most effusive public praises and rewards.

 

Most of us have been guilty of this. SREs are not Smokejumpers. They are not Systems SEALs. Yes, it is a calling that requires a rare combination of skills and knowledge. Yes, we continually prepare and drill and train to handle outages when they do come.

And yes, it feels good when people from all through the organization—especially senior people and business or product development folks not even in your department—recognize you as the responder who saved the day and thank you for everything you endured in order to prevent the End of the World. Plus, we get these neat patches and achievement badges and get to tell “war stories” about “fighting fires.”

But the hero culture concept that lauds and rewards responders for personal sacrifice in the face of system failures is destructive. Not only does the response itself suffer, but rewarding operational endurance rather than good engineering and prevention provides the wrong incentives and leads directly to ops churn and engineer burnout.

Unrested engineers are unproductive engineers and, on the whole, more unhappy people than they would be without prolonged or frequent interruptions in their work and personal life.

Being on call is a good way to learn how complex services fail and keep in touch with the as-built characteristics of your systems. But it should be once or twice a quarter, not once or twice a week, and it shouldn’t involve sprinting a marathon because the team is too small and there’s no one else on it or outside of it to whom you can hand off.

The incident load per shift should be low enough so as not to overwhelm the time available for work, sleep, family, or any other critical component in a sustainable life—a couple a week at most on average, and when someone gets blasted with more than that in a shift, there should be no shame in rotating in other engineers to pick up the overflow or to pick up other duties while the on-call catches up on sleep...or life.

Instead of praising someone who takes on an entire outage themselves, we should be questioning why we didn’t rotate in additional personnel as needed, and asking whether the system design is as simple, reliable, resilient, and autonomous as possible.

If you’re a leader in the site reliability function at your organization, you need to do everything you can to promote a culture throughout the company or institution, all the way to the highest levels, that models sustainable work and incident response, and praises engineering work that improves scale, resilience, robustness, and efficiency above responders who throw themselves into the breach left by the company as a whole not prioritizing such work.

Antipattern 7: Alert Reliability Engineering

Monitoring is about ensuring the steady flow of traffic, not a steady flow of alerts.

Alert Reliability Engineering: Creating a monitoring/logging infrastructure that results in a constant flow of notifications to the system operators. Often a result of adding a new alert or threshold for every individual system or every historically observed failure.

 

Alerts need to scale sublinearly with system size and activity just like everything else SRE does. You should not become uneasy because you haven’t heard any low-level, nonurgent, spammy system notices from your logs and monitoring in the last hour and it’s quiet; too quiet, and you need to begin looking for magic smoke signals and fire up those jump planes.

Stop paging yourself for anything other than UX alerts. Or, at the very most, those user-facing outages plus any imminent failures on whatever Single Points of Failure that you might not yet have engineered out of your organization. For critical failures that are not immediately customer detectable, velocity-bounded thresholds are your friends here. If you are losing systems at an unsustainable rate, or if you get down to three replicas out of five, or if your data store detects corruption that you need to head off before it ends up replicated into your clean copies, by all means page away. But don’t do it naively at every failure or sign of trouble.

There’s no shame in accommodating a system with a mixed maturity model where necessary. You start with the hand you’re given, and work to change out the jokers as you’re able. But you’ll never get to a big payday if you don’t accept that it’s more important to focus effort on improving your systems than it is to burden people trying to achieve a state of flow and do effective engineering work with a stream of constant alerts and attention interrupts.

Outside of that, however, host alerts are worse than no alerts. Focus on alerting for system- or at least site-wide metrics. Otherwise, you will either end up with alert fatigue and miss critical issues, or the vital project work that can get—and keep—you out of operational churn will be buried under an avalanche of interrupts, and you will never start the virtuous cycles of productivity/efficiency that are at the heart of SRE.

In a reasonable system, outages page. Lower-grade problems that can’t be resolved automatically go into your ticketing system. Anything else you feel you have to note goes into logs, if anywhere. No email spam. Email is for high-value, actionable data created by your colleagues,12 like meeting invites and 401K renewal notices.13

Don’t accept the noise floor and either alert too much or stop alerting on it. You can fix this with aggregation and velocity trending.

Antipattern 8: Hiring a Dog-Walker to Tend Your Pets

Configuration management should not be used as a crutch.

Hiring a Dog-Walker: Using advanced configuration-management tools like Puppet or Chef to scale mutable infrastructure and snowflake servers to large numbers of nodes, rather than to help migrate to an “immutable” infrastructure.

 

Configuration management is great. It is a prerequisite for doing real reliability engineering, but we’ve seen people presenting on how they’ve managed to scale it to support hundreds of configurations for thousands of hosts.

This is a trap. You will never be able to scale your staff sublinearly to your footprint if you don’t instead use configuration management as a tool to consolidate and migrate to immutable infrastructure. And if you’re not scaling your staff sublinearly, you’re not doing SRE. Or not successfully, anyway. Sublinear scaling is the SRE watchword, for people, processes, systems—everything.14

Pets < Cattle15 < Poultry.16 Containers and microservices are the One True Path.17 At least until they are potentially replaced by “serverless” functions.18 In the meantime,19 get your fleet standardized on as few platforms as possible, running idempotent pushes of hermetic build and config pairings. I could throw some more buzzwords at you here, but instead I’ll just tell you to read Jonah Horowitz’s chapter on immutable infrastructure, right here in this very book (Chapter 24).

Antipattern 9: Speed-Bump Engineering

Prevention of all errors is impossible, costly, and annoying to anyone trying to get things done.

Speed-Bump Engineering: Any process that increases the length of the time between the creation of a change and its production release without either adding value to or providing definitive feedback on the production impacts of the change.

 

Don’t become a speed bump. Our job is to enable and enhance velocity, not impair it. Reliable systems enhance velocity, and systems with quick production pipelines and accurate real-time feedback on system changes and problems introduced enhance reliability.

Consider using error budgets to control release priorities and approvals.20 If you aren’t, define explicitly what criteria you’re using, instead, and how they provide an effective mechanism for controlling technical debt without requiring political conflict between production and product engineering.

Whatever it is needs to be nonsubjective, and not require everyone in the conversation to have an intimate technical understanding of every aspect of every component in the service. Change Control Boards fail both of these tests, and generally fail at satisfying reliability, velocity, and engineering time efficiency as well.

Studies have shown that lightweight, peer-review-based release controls (whether pair coding, or pre- or post-commit code reviews) achieve higher software delivery performance, while additional controls external to the engineers creating the changes are negatively correlated with feature lead time, deployment frequency, and system restore time in the event of a failure. They also have no correlation with the rate of change-induced failures.21

There are legitimate reasons to gate releases, but they should only be around concerns not related to the contents of the releases themselves, such as capacity planning and so forth. They should also be ruthlessly pruned and analyzed regularly to make certain they apply only to the minimum set of circumstances possible (completely new product launches rather than feature releases on existing products, for example, or capacity planning requirements for launches on services over a certain percentage of total system capacity) and only for as long as the constraining circumstances that require them are applicable (so long as the company is building its own data centers/clusters rather than contracting with third-party cloud providers that can provide capacity on demand, perhaps, or only until a standardized framework can be created to automatically handle the implications of new privacy legislation appropriately).

Be very careful about placing any obstacle between an engineer and the release of a change. Make certain that each one is critical and adds value, and revisit them often to make certain they still provide that value and have not been rendered irrelevant by other changes in the system.

Antipattern 10: Design Chokepoints

Build better tools and frameworks to reduce the toil of service launches.

Design Chokepoints: When the only way for every service, product, and so on in an organization to be adapted to current production best practices is to go through a non-lightweight process involving direct consultation with a limited number of production engineering staff.

 

Your reliability team should be consulting with every product design. But your reliability team cannot scale sublinearly if they do consult on every product design. How do you resolve this?

Many teams use direct consultation models, either through embedding the engineers working on production tooling and site reliability within their product development organization or by holding office hours for voluntary SRE consults and conducting mandatory production reviews prior to launch.

Direct or embedded engagement has many things to recommend it, and I still use it for building new relationships with product teams or for large, complex, and critical projects. But, eventually, we reach a point where even temporary direct engagements like tech talks, developer production rotations/boot camps, production readiness reviews, and office hours can’t scale.

That’s no reason not to do them, because they are incredibly beneficial in their own right to collaboration, education, and recruitment. But we need something more.

If you’re not looking at creating and maintaining development frameworks for your organization that incorporate the production standards you want to maintain, you’re missing a great opportunity to extend SRE’s impact, increase the velocity of your development and launch processes, and reduce cognitive load, toil, and human errors.

Frameworks can make certain that monitoring is compatible with existing production systems; that data layer calls are safe; that distributed deployment is balanced across maintenance zones; that global and local load balancing follow appropriate, familiar, and standardized patterns; and that nobody forgets to check the network class of service on that tertiary synchronization service cluster or comes out of a production review surprised to learn that they have three weeks’ worth of monitoring configurations to write and a month-long wait to provision redundant storage in Zone 7 before they’ll be able to launch because of a typo when they filled out the service template.22

Antipattern 11: Too Much Stick, Not Enough Carrot

SRE is a pull function, not a push function.

Too Much Stick: The tendency to mandate adoption of systems, frameworks, or practices, rather than providing them as an attractive option that accomplishes your goals while making it easier for your partners to achieve theirs at the same time.

 

A common theme uniting the previous two antipatterns but extending beyond them is that you will not get any place you want to go by trying to be the gatekeepers of production, or by building a tool and then trying to force people to use it.

Worse, it has been scientifically verified that you will do harm. Organizations where teams can choose their own tooling, and which have lightweight, intrateam review processes, deliver better results faster than those trying to impose such decisions on their teams externally.23

Security and production engineering teams that make it harder for people to do their jobs rather than easier will find they haven’t eliminated any risk; they’ve only created a cottage industry in the creative bypass of controls and policies.

At Dropbox, the mascot for one of our major infrastructure efforts around rearchitecting the way teams deploy services is an astronaut holding a flag emblazoned with a carrot emblem (Figure 23-1). It’s not just cute, it’s an important daily reminder to the teams involved.

To boldly grow.
Figure 23-1. To boldly grow24

You should focus on building better developer infrastructure and production utilities such that product teams will see productivity wins from adopting your tools and services, either because they are better than their own, or because they are “good enough” for the 80% use case while still offloading significant development effort, and significant operational load as well, for turning up and supporting services.

Your goal is to be building cool-enough stuff by listening to your colleagues’ pain points that they will see better advantage and an increase in their ability to execute from adding an engineer to the SRE team than they would to their own, and will begin lamenting that you can’t find more people to hire rather than fighting over funding.25

The only way to do this is to build good tools and collect good metrics about the real productivity benefits they provide. The good news is that reliability engineering is also velocity engineering. You are not a cost center. You are not a bureaucrat. You are not a build cop. You are a force multiplier for development and a direct contributor to getting the customer what they want: reliable, performant access to the services that make their lives better and more efficient.

Antipattern 12: Postponing Production

Overly cautious rollouts can produce bigger problems.

Postponing Production: The imposition of excessive lead time and testing delays in a misguided attempt to prevent any possibility of system failure, especially where it interferes with the capability for engineers to get feedback on the real impacts of their changes rapidly and easily.

 

Sometimes, in our desire to protect production from potentially bad changes, we set up all sorts of checks and tests and bake-in periods to try to detect any problems before the new bits ever see a production request.

Testing before production is important, but we need to make sure that it doesn’t introduce a significant delay between when developers make changes and when they get real feedback on the impact of their release.

The best way to do this is through automation and careful curation of tests, and, where possible, through providing early opportunities for production feedback through dark launches, production/integration canaries, and “1%” pushes, potentially even in tandem with the execution of your slower and more time-consuming tests (load, performance, etc.), depending on your service’s tolerance for, or ability to retry, errors.

We need to make it possible for product developers to know the actual impacts of their release as soon as possible. Did error rates go up? Down? How about latency? We should expose these kinds of impacts automatically as part of their workflow, rather than making them go look for information.

Product developers are part of your team. They should be able to see what results their efforts are producing in production in as close to real time as possible.

Focus on shortening the feedback loop for everything you can, from early development testing to performance testing to production metrics. Faster feedback delivers greater velocity while actually increasing safety, rather than imperiling it, especially if you are doing automated canary analysis and performance/load testing. Computers are better at finding patterns in large datasets than humans. Don’t rely on humans.

The ability to quickly roll forward or backward, in conjunction with the ability to slice and dice your traffic along as many meaningful distinctions as possible and control what portions get delivered to which systems, reduces the risk of these practices because when a problem is discovered you can react to it. When coupled with the ability to automate rollback of changes that are detectably bad or not performant, the risk drops almost to nothing.

Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)

Failure is inevitable. Get good at handling it, rather than betting everything on avoiding it.

MTTF > MTTR: Inappropriately optimizing for failure avoidance (increasing Mean Time to Failure [MTTF]), especially to the neglect of the ability to rapidly detect and recover from failure (Mean Time to Recovery [MTTR]).

 

Delayed production rolls are essentially a variant of a broader antipattern, which is spending disproportionate design and operational effort to keep systems from failing, rather than ensuring that they can recover from the inevitable failures quickly and with minimal user impacts. But resilience trumps robustness except where entropy prevents it.

The truth is, there are applications for which it is not possible to recover from some kinds of failures. To take a page from medicine, when someone experiences brain death, we can’t bring them back. So, it makes sense to try to make the body as robust as possible through exercise and healthy diet and do everything we can to prevent an organ failure or other breakdown that might cascade into a “negative patient outcome.”

Exercise and diet can improve resilience and work to prevent the occurrence of, say, a heart attack that will cause the heart to stop. But even a single lifetime ago, we used to be unable to bring people back from acute hypothermia or cardiac arrest. This was followed by laborious manual resuscitation in the 50s, then centralized defibrillation by doctors dispatched from hospitals in the 60s, then rapid response decentralized portable defibrillation by paramedics in the 70s, then even more rapid automated highly distributed defibrillation by EMTs in the 80s, and ubiquitous automated defibrillation by the general public in the 90s and the implantable/wearable units that followed.26

Now, we have recovery mechanisms that can help restore function for people suffering those traumas even where prevention failed. We can even use the planned inducement of one failure (hypothermia) to help increase the survivability of other traumas (drowning, heart stoppage), or even cause a controlled heart attack to help us prevent a larger, more damaging unplanned one in the future. Every year, we get better at replacing all sorts of organs we couldn’t prevent from failing. We haven’t cracked the problem yet, but we’re already seeing rapid detection, response, and recoverability averting catastrophes that prevention could not.

We’re much further along that path when it comes to computing. The companies whose site reliability efforts I have knowledge of experience failure every day. But they expect it, anticipate it, and design everything in the sure and certain knowledge that things will break, traffic will need to be rerouted without pause, and failed systems will need to be brought back into play in short order, with automated redeployments rather than extensive administrative effort.

Chaos Engineering is a critical tool of modern service planning and design. One hundred percent uptime is a myth, because all change introduces risk into your system. But freezing changes removes your ability to address existing risk before it results in failure, and there is always risk.

Either way, your systems will fail, so it’s important to accept that failure will occur and seek to minimize the size of your failure domains. Introduce capabilities to tolerate failure and provide degraded service rather than errors where possible. Sufficiently distribute your service so that it can use that capacity for toleration to continue service from the portions of your infrastructure outside the failure domain. And minimize the time and degree of human intervention needed to recover from failure and—where appropriate—reprocess any errors.

One of the best ways to ensure that these design principles are adhered to in practice is to introduce survivable failures routinely into your system.

An interviewer once asked me, “What are the characteristics of a good SRE?” One answer is that SREs need to be good at writing software, debugging systems, and imagining how things can fail.

That last one is part of what defines the difference for me between a software engineer working primarily on reliability and one working primarily on product development. We play a game sometimes in which we’ll take a running instance of production, isolate it, and then try to guess what will happen if we break something in a particular way.

Then we go ahead and do it and see whether we get the predicted behavior or some other failure mode that’s maybe a little more exciting and unexpected. We try to find the thresholds and the tipping points and the corner-case interactions and all the wonderful ways in which a complex system you thought you knew intimately can still surprise you after several years.

It’s a lot like hacking or software testing, just with a different focus, and so a different set of attack surfaces and leverage. After a while, we become connoisseurs of failure. And then we get to try to figure out how to keep the things we discover from affecting the users of the service, and then try to break those fixes all over again. It’s a good time, if you’re into that sort of thing.

When you’ve played that game, though, the best thing is to take the most impactful of the lessons you’ve learned from it and incorporate them into your automated stress/canary/production testing—your Chaos Monkey or whatever tooling you’re using—so that these kinds of tests are applied to the system regularly over time and help make sure that future system changes don’t result in degraded robustness or resilience. This is possible only if you have good resiliency, and specifically, detection, rollback, and recoverability, so that production traffic can be shielded from the consequences of any newly introduced weaknesses/regressions.

When we work to make sure that systems will face these kinds of tests regularly, it forces site reliability developers to think more about product design and what infrastructure help they can provide to product teams. It also forces product developers to think more about designing for scale and survivability and making sure they take advantage of the reliability features and services that SRE helps provide for them. This keeps the explicit contract at the genesis of SRE—to prioritize reliability as a technical feature wherever it is required—front and center in everything the organization undertakes, without extraneous effort from either group.

Antipattern 14: Dependency Hell

Dependency control is failure domain control.

Dependency Hell: Any environment in which it is difficult or impossible to tell what systems depend upon one another, to tell whether any of those dependencies are circularities, or to control or receive notification of the addition of new dependency relationships, as well as any impending changes to the interoperability or availability of entities within the dependency web.

 

In any mature organization, where the software development life cycle (SDLC) has reached the point that old projects and tools are being deprecated and retired as new platforms and components and launched, unless care is taken, it will inevitably reach the point at which the interdependence of those components grow beyond the easy knowledge of any one person. The ability to predict what might be affected by your changes or what other changes might affect the systems under your control and plan accordingly becomes humanly impossible.

Make sure you have a facility to detect in an automated way what dependencies are being added to your services in something akin to real time (so that you can have timely conversations ahead of launching with the new dependency, if necessary) and to make sure that your disaster plans and road maps are updated accordingly.

Chaos engineering—or even simply business as usual—will point this out to you eventually, of course, but it’s far better to track and plan for this before things get to that point. As an added bonus, explicit tracking of this can be used bidirectionally and can save service owners a great deal of time and energy in the process of migration, deprecation, and turndown.

Antipattern 15: Ungainly Governance

You can’t steer a mosquito fleet like a supertanker.

Ungainly Governance: It is difficult, if not impossible, to run an Agile service delivery and dev/prod infrastructure group within a larger organization that is unwilling to adopt Lean and Agile principles in its other operations as well, or at least at the point of demarcation.

 

If your larger organization is locked into an antique structure of traditional IT governance, in which approvals, budget, and deliverables are tied to specific large projects or project bundles, you’re going to have a hard time realizing the kind of continuous development, improvement, and release processes that are at the core of SRE, even if you are Lean and Agile within your own purview. Rigid money buckets, inadequate or unused features, zombie projects, and all manner of pervasive capacity misallocation will inevitably result.

Budget should be something that flows to organizations, not projects. And organizational leadership should be held accountable for outcomes—the results delivered across the company for the resources invested in the organization—rather than which hardware was purchased for what prices and how many hours were spent on what tasks.

Incentivize leaders to deliver quality results efficiently, and then trust them to budget and drive that process within their own organization rather than pushing down requirements and prescriptions. Prefer tracking the metrics of how you’re improving current or eventual business outcomes continually over the course of work rather than judging or enforcing outcomes by adherence to the original plan or budget and use those metrics to guide you in updating your planning and execution as you go. Look at using strategic alignment techniques like Hoshin Kanri, Catchball, and Objectives and Key Results (OKRs) to transmit broad goals to your reliability engineering organization, rather than fully enmeshing them in any more rigid systems that might pertain elsewhere in your enterprise.

Honestly, this is one of those implicit assumptions that nobody talks about because most SRE organizations exist in a culture that has already abandoned the old governance models, and SREs don’t realize we’re swimming in the water until we wash up on the shores of some brown-field opportunity and suddenly find ourselves wide-eyed and gasping.27

Getting these kinds of foundational, cultural alignments in place is critical to getting an effective reliability engineering culture and teams under way in an existing technology organization, right up there with not Passing the Pager (Antipattern 2) or (re-)inventing Site Reliability Operations (Antipattern 1).

Antipattern 16: Ill-Considered SLOh-Ohs

SLOs are neither primarily technical nor static measures.

Ill-Considered SLOh-Ohs: SLOs set or existing in a vacuum of user and business input and either not tied bidirectionally to business outcomes, priorities, and commitments, or not updated to reflect changes in the same.

 

SLOs are business-level objectives. They should not be set based on what you can deliver but based on what you need to deliver in order to be successful with your customers, whether internal or external.

Time after time, we see teams slap a monitor on their system, measure all sorts of things for a month, and then pick some Service-Level Indicators (SLIs) and set their SLOs based on what they measured over that period. And then never think about those levels again.

The SLO process begins when you’re designing a system. It should be based on the business case and deliverables for the system, as discovered through product management, customer support, developer relations, and any number of other channels.

SLIs should be chosen based on intelligent engineering discussion of what things matter in a system and how appropriate operation of those things can be proven.

SLOs should be set through reasoned analysis of what performance and availability are needed to be useful to (and preferred by!) customers over the other options already or soon to be available to them.

If you’re not doing this, you’ll end up optimizing for the wrong things, and find that adoption is stunted in sectors that you were counting on, even though your service has good adoption in others. You’ll also find that you’ve made the wrong choices in instrumentation, so that when you conduct some kinds of maintenance, or experience certain transmission problems, service errors, or capacity degradation, your system becomes useless to customers without you even being aware that anything is wrong until it shows up on Twitter or Reddit.

SLOs should be a living document. If you don’t have a mechanism for revisiting them periodically or as needed, they are going to become irrelevant. If your targets don’t keep up with users’ changing needs, they will abandon your service. If you’re exceeding the promised level of service significantly and consistently, you’re going to surprise users in unpleasant ways when you make decisions about what strategies are acceptable for migrations, maintenance, production testing, or new rollouts.

Most important of all, your entire business needs to be aligned behind your SLOs. Everyone needs to know explicitly how they tie to revenue or other business goals. The SLO is not simply a stick with which to beat SREs when goals are being missed. It is a lever that should be able to elicit support and resources from the larger organization as necessary as well as driving design decisions, launch schedules, and operational work. These are business commitments, not just SRE commitments.

Capital and operational budgets need to reflect the priorities expressed in these metrics. Staffing, design decisions, and work prioritization will, at the root, flow from this data every bit as much as market research or product brainstorming sessions.

Decisions need data, and if everything is working properly, the data about how current resource allocation is affecting service and how service is affecting revenue—or work outputs, or patient outcomes, or whatever top-level organizational goal it ultimately ties to—is one of the most fundamental decision inputs that will drive your organization. So make sure these key metrics and targets are well chosen, appropriately scoped, and broadly understood and accepted.

At their heart, SLIs and SLOs are a tool for reasoning and communicating about the success of an organization. That organizations inevitably produce systems that reflect their organizational communication structures is received truth at this point.28

We don’t have experimental evidence for causality in the other direction, but in the case of SLOs, it seems likely that organizations inevitably evolve to reflect their established communication structures every bit as much as do the systems they produce.

Antipattern 17: Tossing Your API Over the Firewall

Server-side SLOs guarantee customer outages.

Tossing Your API Over the Firewall: Failing to collaborate and integrate with key external parties using the well-established methods by which SREs collaborate and integrate with their internal customers and partners, and not measuring, sharing responsibility for, and attempting to remediate risks from outside your own systems to successful customer outcomes.

 

At the core of what the DevOps philosophy teaches us is the realization that operational silos result in missed SLOs. The effects of laggy communication boundaries and “Somebody Else’s Problem” between distinct organizations are not as different as we would like to believe from the effects they create between the distinct teams within them.

Tossing your API over the internet to your customers—that is, handing them a Service-Level Agreement (SLA) for response times at the edge of your network, cashing their check, and then waiting for their inevitable support tickets—is as much an antipattern as tossing your code or binaries over the wall to the ops team was a decade ago, and for the same reasons.

Before you become too indignant about this statement, remember: tossing requirements and code over the wall was a conscientious, disciplined, and accepted business practice a decade ago, just like defining SLIs and SLOs based only on the systems controlled by your own company is today.

I’m going to steal some math here for a moment,29 but over a 30-day window, hitting a 99.99% reliability target means you can miss your SLO for only 4.32 minutes. If you want to hit 99.999%, that drops to 26 seconds.

What this math means is that without shared metrics and alerting, in which you are paying attention to the performance of the clients that are ultimately consuming your service, and any intermediate partners that own those clients and depend on you in order to service them, there is no way for your customers to consistently meet even a four 9s SLO for their own users, even if your system achieves that threshold itself.

Just the time needed for them to get paged, investigate, file a ticket that pages you, and have you investigate—even without repair time—will blow their error budget for the month. Likely for the entire quarter.

After your own house is in order and you can deliver four 9s or higher, you need to get out ahead of this. Look at your traffic. Talk to your PMs. Figure out who your critical customers are. Integrate and establish cooperation in advance.

Decide on shared metrics. You can’t have a conversation if you aren’t talking about the same things. Make sure everyone understands what the SLOs really mean, and that the on-call response guarantees make sense with the SLOs. Put shared communication and incident response procedures in place before you have a problem, so you can have the same kinds of response patterns to your critical customers’ issues as you have to the issues of other teams within your organization.

When I onboard significant external customers, we sit down and look at what they are measuring versus what we are measuring, and at what the system as a whole, both our parts and theirs, needs to deliver to their customers rather than just to them. We determine any changes needed to grow gracefully together, and we roll them out in conjunction.

To do that, we need to commit to immediate alerting without first establishing that the problem is on one end or the other, and we have to provide remote pathways to shared integration testbeds and production canaries. Always, of course, keeping in mind that SRE’s customers must be good partners regardless of whether they are internal or external, and unresolved tech debt and unsustainable operational burdens will result in devolving this kind of unmediated engagement.

It is not just the Googles and the Amazons that need to think about this sort of thing. As demand for ubiquitous services grows, and more companies build interdependent offerings, we’re going to have to collaborate more with our most valuable customers and partners. The level at which Google CRE30 is doing this might be beyond the scope needed today by most organizations, but the ideas are readily applicable to many services.

Antipattern 18: Fixing the Ops Team

Organizations produce the results that they value, not the results their components strive for.

Fixing the Ops Team: The mistaken belief that service delivery can be improved by bringing SREs to do SRE “to” a company, rather than for a company to start doing SRE.

 

This is another of those broad fallacies that people embracing the SRE brand for the first time don’t always initially understand. It’s basically the opposite end of the spectrum from Site Reliability Operations (Antipattern 1) in which the organization thinks that SRE is just a buzzword they can rebrand with and use to try and keep up in the recruiting competition with other companies.

At this end, people think they have an ops problem that is centered in their ops team, and if they just replace those ops engineers with SREs who have the secret sauce, they’ll get the kind of results that Google and Facebook and the like get from their production engineering. But the problem isn’t the ops team, it’s the systemic structure that makes their operations mission impossible to achieve at scale.

SRE is not about fixing ops. It isn’t just an ops methodology. Successful SRE requires fundamentally reordering how the entire company or institution conducts business: how priorities are set; how planning is conducted; how decisions are made; how systems are designed and built; and how teams interact with one another.

SRE is a particular formula for doing “DevOps” in a way that is organizationally more efficient and sustainable, and more focused on production service quality. It’s not just a job or a team. It’s a company-wide cultural shift to get away from gatekeepers, shadow operational costs, and institutionalized toil, and to create a healthy feedback system for balancing the allocation of engineering resources more effectively and with reduced conflict.

At the core of it is the commitment to appropriately scoped reliability as a core feature of any systems the organization creates or relies upon; to relentlessly winnowing out operational toil, delay, and human error through process and software engineering; and to shared responsibility for outcomes.

As with Lean and DevOps, both of which SRE ideally has many characteristics in common, it is an ongoing process that requires dedicated attention not just from the expert team responsible for coordinating the efforts, but also from all the other business units upon which the value stream depends.

Thus, absent commitment from the very top of the organization31 to driving this model, your ability to realize the benefits will be limited. You can still make substantial improvements in your team processes and systems. With enough grassroots support from other teams, you might even end up with a fairly functional DevOps model.

But if you want to reliably create virtuous cycles of productivity through prioritizing engineering over toil, even at the potential cost of an SLO shortfall now and then; to establish well-aligned priorities across the organization that won’t be repeatedly abandoned or regularly mired in the conflict du jour; to create and safeguard the sustainable work balance and systems needed to attract and retain top talent then; the further up the organization that understanding and buy-in to SRE principles goes, the closer you will come to realizing your goals. Do whatever work you must in order to make certain the entire organization subscribes to the same strategy.

So, That’s It, Then?

That’s all of them? We’ve got it all figured out?

So, what does this all mean? What do we do with this information, with these hard-won lessons in unintended consequences? Just a handful of little dead ends to watch for, and as long as we avoid these we’ll never have a problem, right?

 

Sadly, not. Because each unhappy technology or business is unhappy in its own way, and because I haven’t been around to see even more than a fraction of them, we have to expect this list to keep growing and changing as the industry does.

The important thing here is the process—the most fundamentally SRE process—of continuing to look at the places we and our peers have run into trouble, and then not only creating learning opportunities for our own organizations, but turning those catalogs of failure into stories that we can share across the industry as well.

In fact, I think the time might even be ripe for an antipattern repository where peers can share, discuss, and categorize potential new patterns or variations. If you have one that we’ve left out, we definitely want to hear from you before we stub our own toes on it. So @BlakeBisset me on Twitter, #SREantipatterns. Even if we do not find anything new, at least we might perhaps find the conversation very pleasant?

Spes non consilium est!32

Contributor Bio

Blake Bisset got his first legal tech job at 16. He did three startups (one biochem, one pharma, and this one time a bunch of kids were sitting around wondering why they couldn’t watch movies on the internet) before becoming an SRM at YouTube and Chrome, where his happiest accomplishment was holding the go/bestpostmortem link at Google for multiple years. He currently serves on the USENIX SRECon program committee and as Head of Reliability Engineering at Dropbox.

1 Perhaps not even as reliably as plants! See this article about Hagai Shemesh and Alex Kacelnik and the accompanying link to a Society for Risk Analysis article.

2 For a great reading list on risk perception, check out Bruce Schneier’s bibliography for his Psychology of Security essay.

3 As long as it’s someone else’s. Including sufficiently-long-enough-in-the-past-you, because, let’s admit it, in the immortal words of Bugs Bunny: “What a maroon!”

4 Well, you can, of course. Minus the magic part. But it isn’t going to end up any better than your existing system. This does not seem to stop people from taking this approach time and again. And then giving talks about it. All of which seem to end right at the point of the rebranding, as though that were “Mission Accomplished!” in and of itself without actually effecting measurable changes in reliability.

5 Decreased Productivity; Decreased Well Being; Increased Sick Days; 2014 New Yorker review of literature; Memory performs better if we have our own consistent space; Effects of interruption on engineer productivity.

6 Anything’s better than endless tables with no visual or auditory separation, and even cubes can be cool.

7 1. Observe the situation. 2. State the problem. 3. Hypothesize the cause/solution. 4. Test the solution.

8 Wikipedia entry on Incident Command System.

9 pagerduty incident response “Being On-Call” and “Incident management at Google — adventures in SRE-land” by Paul Newson.

10 Assuming the human is even available to interview afterward, which in a severe accident is not always the case, the disturbing evidence of the past few decades is that many of the things we remember never actually took place: “The movie that doesn’t exist and the Redditors who think it does” by Amelia Tait and “Why Science Tells Us Not to Rely on Eyewitness Accounts” by Hal Arkowitz and Scott O. Lilienfeld

11 Most of my theory in this area comes from Allspaw’s writings and from the cross-disciplinary sources to which they have introduced so many production engineers. For a more detailed discussion of this topic, see John Allspaw’s “The Infinite Hows” and Woods/Dekker/Cook/Johannesen/Sarter’s Behind Human Error.

12 Artificial intelligence does not count as a colleague. Yet.

13 High-value examples courtesy of my ranting partner Jonah Horowitz, who also contributed to this book.

14 Except, hopefully, compensation. :-)

15 Thank you, Randy Bias.

16 Thank you, Bernard Golden.

17 Except, of course, where they aren’t.

18 Leading to the inevitable addition “Poultry < Insects.” Or whatever slightly less bug-related mascot we eventually land on FaaS offerings like Lambda, GCF, Azure Functions, and OpenWhisk. Protozoa perhaps?

19 Which given how much shorter the reign of each successive paradigm in the Mainframe > Commoditization > Virtualization > Containerization timeline has been—and how much less mature when production services began shifting to it—should be approximately 8:45 AM next Tuesday.

20 Error budgets have been covered extensively in O’Reilly’s first SRE handbook as well as in an inordinate number of conference talks (of which I am perhaps guilty of having given an inordinate percentage). If you don’t know about them already, find the book or one of the many videos, and prosper.

21 Forsgren, Nicole, Jez Humble, and Gene Kim. (2018). Accelerate: The Science of Lean Software and DevOps. Portland, OR: IT Revolution Press.

22 Hypothetically. Nothing like this has ever actually happened to anyone I know who wasn’t using SRE supported frameworks at any major internet company.

23 Forsgren, Nicole, Jez Humble, and Gene Kim. (2018). Accelerate: The Science of Lean Software and DevOps. Portland, OR: IT Revolution Press.

24 Thanks to Maggie Nelson and Serving Platform for being more awesome than a box of carrot cupcakes.

25 Amusingly, any sufficiently large carrot is also functionally a stick.

26 Seattle was a pioneer in cardiac response as well as cloud computing, so these dates might not match up exactly with what you know from your own history, but the gist is probably the same: https://en.wikipedia.org/wiki/History_of_cardiopulmonary_resuscitation.

27 I’ve dealt with both systems before, but it’s been so long I would never have thought to call this out to people if Mark Schwartz hadn’t made a point about it after coming back to Amazon following his stint in government.

28 See Conway’s Law.

29 Dave Rensin did a great presentation laying out all the technical principles behind this calculation at SRECon Americas in 2017. If you haven’t seen it, go check out the recording and slides.

30 Dave Rensin, “Introducing Google Customer Reliability Engineering”.

31 CEO is best, C-level is decent. At the least, the heads of your product development and production engineering teams should agree on budget, staffing, on-call, toil, change control, and epic schwag.

32 Hope is not a strategy!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.137.240