Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17

Preparing for Incidents

IN THIS CHAPTER

Minimizing the processes that lead to human error

Improving on-call response

Managing incidents when they occur

Measuring your success

What’s an incident or service outage? Good question! Essentially, an incident is any technical disruption of your business. Incidents come in all shapes, sizes, and severities. For example, if your business is banking, members of your financial institution might not be able to access their bank accounts online. If your business is online photo storage, a potential incident might prevent users from uploading new photos. If your business is retail, maybe users can’t make purchases because your payment processor is down.

Sometimes an incident can be rather tame. Perhaps the “Add to Cart” button is duplicating requests and adding two items to customers’ carts instead of one. Irritating, yes. But the situation isn’t dire because the customer can edit the quantity in the cart. Other times, incidents can be much more traumatic. Perhaps your sign-up form is preventing users from joining your site or your payment processor service is down. Or a database error has erased critical user information. Yikes!

In this chapter, I show you how to prepare for incidents and service outages of all kinds. I walk you through how to ensure that your processes reduce the possibility that humans will be the cause of an incident, how to better prepare your on-call team for incident response, and what to do when an incident strikes. Along the way, I share a few use cases that might help you better prepare yourself and your team for when an inevitable incident hits.

Combating “Human Error” with Automation

Human error can quickly lead you to believe that humans are the “root cause” of a failure. (Read more in Chapter 18 about why root cause is a problematic term.) Instead, if a human happened to be a trigger of failure, look at the situation this way: Judgments and decisions made by an engineer may have contributed to the disruption. Perhaps even more important, consider that the systems and processes of your engineering organization led to (or did not prevent) those judgments or decisions.

Incidents will always be a part of developing and maintaining software. People are only human. It happens. Stuff breaks. The problem with incidents isn’t that they happen. Yes, this reality is unfortunate and uncomfortable, but the real issue is that the same incident (or incidents that are eerily similar) consistently reoccur. These incidents are often long, drawn-out, and stressful events for everyone involved — including customers — and they often repeat themselves.

By now you’ve likely realized that more often than not, humans are the challenge in DevOps. “Human error” is the label that people put on the common (and frequent) occurrence of human mistakes. If you’re thinking that the solution for incidents that were triggered by a human’s decision is to fire all your humans, please don’t do that. (Are you a robot overlord?) Humans, for all our flaws, are still the most capable tool for solving technical challenges. Engineers who set the figurative fires that cause incidents are also the best firefighters for solving the problems.

The most thorough answer the academic world has formed to respond to engineering mistakes is human factors, also referred to as ergonomics, which is the study of human psychology and physiology in design. This field of study applies knowledge about humans from many disciplines — psychology, sociology, user experience, engineering, industrial design — and enables people to design better products and systems, all with the main goal of reducing human error.

If you’re thinking, “I thought ergonomics had something to do with my chair,” you’re right! Physical ergonomics is what improves the products you use every day, from your chairs to your computer screens. What you should be concerned about in DevOps is cognitive ergonomics and organizational ergonomics:

Cognitive ergonomics is the study of how humans perceive and reason about their environment. How do people make decisions or react to certain stimulus? What makes one person extremely reliable and another flaky?
Organizational ergonomics is the study of systems and structures inside organizations. How do teams communicate and work together? What makes some teams cooperative and others competitive?

NOOPS

Occasionally the term NoOps gets thrown around in the DevOps space. NoOps doesn’t mean a lack of operations engineers. Instead, it indicates a focus on automating everything related to operations — deployment processes, monitoring, and application management. Whereas DevOps focuses on helping developers and operations folks work together more seamlessly, NoOps aims to prevent developers from ever interacting with an operations engineer.

In many ways, NoOps is a part of the DevOps movement, but with a different approach. DevOps focuses on people, processes, and technology. NoOps relies on specific software solutions to manage things like infrastructure and deployment pipelines — solving only the technical challenges we face. You can think of DevOps as a fully encompassing cultural shift and NoOps as a much more narrow technical solution.

I am not personally a proponent of NoOps because the skills and experience of operations professionals go beyond manual deployments and other processes primed for automation. Operations engineers are the most qualified for automating toil (or rote work) but also for architecting systems with complex infrastructure components in mind.

Focusing on systems: Automating realistically

Unfortunately, deploying changes to the human brain is still something you may struggle to accomplish. Instead of focusing on preventing humans from making mistakes — an impossible task — DevOps processes recommend that you turn your attention to creating and implementing automated systems along the entire development process.

Automation is the best-known way to combat human error. If you asked humans to write their name a million times in a row, they would eventually misspell their names. Their own names! (They’d also develop a repetitive stress injury.) But if you asked a robot to complete the same task, it would accomplish the job flawlessly, identically printing a name a million times, without error.

The same concept applies to your applications. If asked to repeat rote tasks, humans will make mistakes. Four areas primed for automation:

Code: Software developers design and build solutions via code. Developers manage their source code and often work on the same portion of a codebase simultaneously.
Integration: Code changes must be merged from multiple developers into the master branch of a code repository.
Deployment: After being merged, the code must be deployed. This can often mean releasing updates, changing configurations, and even deprecating services.
Infrastructure: An application must be run on hardware. Depending on the updates to code, infrastructure may need to be instantiated, provisioned, or terminated.

The automation tools in each of these spaces experience a quick rate of churn. Don’t be surprised if your beloved solution loses favor a year or two from now. Tech will always have a “hot new technology” that everyone’s talking about, but don’t be distracted by the latest new thing. Focus instead on the best solution for you and your team regardless of how popular the tool is.

Embracing the best solution for your team is always the best answer. That said, sometimes you do find some benefits of moving with the crowd:

Popular tools often have the best documentation and answers on technical forums. The more people that use a project, the more likely someone is to have documented the code, built a demo, created an instructional video, or answered questions on forums like StackOverflow, a website with nearly endless answers to technical questions. Popular tools are also likely to be accompanied by published documentation and examples. I encourage you to read the docs of any tool before you select it because the tool you choose will determine how smoothly your development goes as you move forward.
Popular tools are often open source software (OSS). Open source software is a broad term to describe tools that are (usually) free to use and open to community input. You can actually go into the tool’s source code, implement a change, and submit a request for the change to be approved. OSS communities are often run by a small team of volunteer engineers. OSS has many benefits, but in this case, you can actually tailor the tool to you. You can clone the current code and build a tool on top of it, or you can commit your code to the project and help others solve the same problem you’re solving. Read more about integrating with OSS in Chapter 19.

Using automation tools to avoid code integration problems

The more automated monitoring and responses you can build into your incident management, the less you’ll have to depend on human escalation and resolution.

Before you can automate any type of incident response, you must identify the key metrics that you want to monitor. Obvious choices might include availability, initial response times, uptime, traffic, and revenue. Others might also add SSL expiration, DNS resolutions, and load balancer health checks. Many of the granular metrics that your team monitors and responds to will be based on your company’s key performance indicators (KPIs).

The best things to automate are processes your engineers manually engage with regularly. Configure your monitoring tools to inject relevant information into your alerts. Status pages are fantastic tools for updating stakeholders at regular intervals. You can build slash commands into chat tools to automatically update your status page. Finally, don’t forget about automating data collection. Logging tools can help you identify what went wrong on a diagnostic level as well as what was impacted. In hindsight, you’ll be able to better understand which areas of your application and infrastructure are brittle and what action you need to take to prevent similar incidents in the future.

Following are some automation tools that you and your teams can use to mitigate incidents at every stage of development. These tools are handy, but you should never rely solely on them to solve the challenges your team faces. Tooling will never remove the need to build a culture, processes, and systems that avoid human error.

CircleCI: A cloud alternative, CircleCI supports many mainstream languages and offers up to 16x parallelization. It is container based, so pricing is based on the number of containers you use. Circle is one of the fastest (and most expensive) options.
Jenkins: Written in Java, Jenkins is open source and extremely flexible. The Jenkins plug-in list is lengthy, to say the least. The learning curve can be a bit steep but is definitely worth the time. You can control Jenkins via the console as well as a graphic user interface (GUI).
Go CD: Like Jenkins, Go has mastered pipelines to help you implement continuous delivery. Its parallelized execution eliminates build bottlenecks. Go is completely free and offers paid support.

You likely already use some kind of source code management or version control tool like Git. In fact, these tools are so ubiquitous that you probably don’t think of them as automation. But they do! Imagine if your engineers had to merge code manually. It’d be a nightmare.

Even if your team hasn’t yet adopted Git (don’t stress!) you may use something like SVN or Mercurial. Whatever the tool, it enables you to manage the work of multiple developers who are making changes to the same codebase. Such tools, make it relatively easy to visualize the differences between two branches, choose the most recent changes, and merge them into one branch — usually the main branch called trunk or master. (I said relatively; don’t curse my name the next time you have a merge conflict.)

I highly recommend adding a continuous integration (CI) tool to your toolset as well. You can consider some of the tools that follow as deployment tools. In fact, most of the tools mentioned in this book are difficult to classify into only one category because they span a number of areas. For this book, I highlight and categorize tools based on their core competency — the feature for which they are best known.

Handling deployments and infrastructure

When it comes to application deployment and configuration management, the available tools aren’t always familiar to people and often require some degree of integration into your current infrastructure and deployment processes. Examples include Ansible, Chef, Puppet and Salt, although this list is far from exhaustive.

As infrastructure becomes exponentially more complicated, observing your systems in real time becomes ever more difficult and the importance of automation in deployments (and infrastructure) increases.

Ansible: Written in Java, Ansible is a Red Hat suite of DevOps-focused products that help teams deploy applications and manage complex systems. Ansible attempts to unify the teams of developers, operations, quality assurance (QA), and security as well as to simplify their repetitive tasks.
Chef: Bridging the gap between engineers and operations folks, Chef is a leader in the continuous automation space. Chef can manage up to 50,000 servers by turning infrastructure configurations into code.
Puppet: Puppet products seek to deliver real-time information about your infrastructure, automate tasks driven by models and events, and create continuous integration and continuous deployment (CI/CD) pipelines that are easy to set up. Puppet helps teams support traditional infrastructure as well as containers.

ALL COMPANIES ARE TECH COMPANIES

Whether you want to admit it or not, your business is in tech, which can be hard to internalize. Here’s a story that illustrates the point. Recently I went to LabCorp, which is a company that draws blood, evaluates the sample against various tests, and sends the results to your doctor. Not very technical, right? Only when I went in for my blood work, LapCorp’s coding system was down nationwide. LabCorp had no analog redundancy. Unless the technicians and phlebotomists had memorized the specific code that they’re required to put on a blood sample for processing, they could not see a patient.

This situation meant that almost all customers had to leave and return another day, which was a lot of lost business. Yet, you might have been tempted to not characterize LapCorp as a tech company. Tech enables all companies to scale their services to more customers than would be possible without it, but the very tech you depend on every day will occasionally fail you. The truth is that you don’t have the option to pretend you’re not a tech company, no matter what business you’re in.

Limiting overengineering

Imagine two bakers. One produces a perfectly warm and airy loaf encrusted by a crisp exterior. Breaking it releases the irresistible, yeasty smell of fresh bread carried by just a touch of steam. The other produces a dense, dry bread encased by a rock-hard crust. Yuck. The bakers followed the same recipe and used the same ingredients. So what went wrong?

In the latter case, the baker overkneaded the dough. The overworked gluten produced a dense, unappealing product. Although both loaves might be equally nutritious, eating the second loaf would be more like gnawing on a rock than biting into bread.

Code isn’t all that different from making bread. Styles vary but most recipes require the same basic ingredients and follow one of a handful of formulas. More often than not, the simplest solution is the best. But no matter how many great ideas you come up with, you’ll have some fairly terrible ones as well. The trick is to recognize the terrible ones quickly and invest heavily in the great ideas. Discerning the difference is a learned skill.

An engineer loves few activities more than, well, engineering. Engineers love solving problems. The more complex, the better. Upon hearing about a problem, most engineers want to immediately dive into the first solution that pops into their head.

This instinct, although admirable, doesn’t always lend itself to finding the best solution — only the most obvious one. Often when you hear the term overengineering, the reference is to code that’s overworked or solutions that are unnecessarily verbose or complex.

Here are a few warning signs that a solution is overengineered:

The problem is more easily managed manually. Not every problem needs to be automated. Do you need to write a to-do app when pen and paper work just fine? Maybe, but probably not. Make sure that a technical solution is efficient and necessary before developing it.
The code is unusually verbose. If the lines of code required to solve something are double the amount needed for typical bug fixes and feature implementations, look into why.
The solution wasn’t peer-reviewed. All implementations should be discussed with a peer prior to development or reviewed by a peer before being merged into the rest of your source code. This prevents myopic and unnecessary code.
The code is difficult to understand. If a junior engineer can’t interpret what a piece of code is doing within an hour, take that as a warning sign. Code must be maintained, and all engineers need to ensure not only that their code works but also is readable by their colleagues and their future self.
A free or cheap tool exists that solves the problem. Spending time engineering a solution to a problem that has already been solved is foolish. Research the tools that already exist to ensure that writing code is necessary.

Before you automate something, solve the problem manually first. Even if it requires — gasp! — pen and paper or, arguably worse, a spreadsheet. Making sure that your approach works before you automate it is important. Otherwise, you end up wasting time and engineering resources on unused, ineffective solutions.

Humanizing On-Call Rotation

Being on call is akin to being available to handle emergencies. If the site goes down or your customers are impacted by a technical failure, you are the designated person to manage the issue — no matter when it happens.

Imagine that you have to rush your toddler to the ER at midnight because he decided to swallow your wedding ring. The on-call surgeon affiliated with the hospital might be paged to come in and treat your child. They are physically close to the hospital and prepared to go in when necessary. You can apply the same principle to on-call engineers in a DevOps organization.

When on-call duties become inhumane

One of the most significant cultural and organizational shifts in adopting DevOps revolves around a shared on-call responsibility. Traditionally, developers would write the code to implement a feature and pass it to the operations team to deploy and maintain. This meant that only a handful of operations engineers were on call for when a poorly developed piece of code failed.

Having too few people on call is one of the key problems DevOps attempts to solve. By sharing responsibility, both teams can have autonomy and mastery over their work. That shared responsibility also means that the burden of being on call is distributed over a much larger group of people, which prevents burnout.

Site reliability has become increasingly important. Many companies lose hundreds of thousands of dollars for every hour their sites are offline. Companies can build resilient systems to avoid catastrophic failure, but every company must also keep engineers on call to handle unexpected emergencies.

The typical process for responding to an incident looks something like this:

Customers are impacted. Maybe your monitoring software has alerted you that the site’s taking 20 seconds to load. Maybe there’s a regional outage and European customers are yelling at you on Twitter. The types of incidents are nearly limitless, but someone’s mad.
The primary person on call is alerted. Services like PagerDuty and VictorOps allow you to customize who gets alerted and how. If the primary person on call does not respond within a set amount of time, the secondary contact is paged.
An engineer attempts to fix the problem. Sometimes the issue isn’t critical enough to address in the middle of the night and can be fixed the next morning. Other times, the server room is literally flooding and someone needs to get a bucket. (Hurricane Sandy in 2012 flooded two major data centers in lower Manhattan.)

This all sounds great. Sites stay online and responsibility is shared, right? Not usually. Unfortunately, being on call can quickly become inhumane. Traditionally, system administrators and operations engineers are the only folks who end up on call, which goes against core DevOps principles and reinforces silos. I believe strongly in shared responsibility. You build it; you support it.

Humane on-call expectations

Making a true jump to a DevOps model for creating, deploying, and supporting sites requires that on-call duties be shared by every engineer involved in a product. On-call rotation is an opportunity, not a punishment. It’s an opportunity for engineers to think differently, learn new skills, and support their team and for the organization to build better systems and processes to:

Document code better
Create runbooks (step-by-step guides of what to do) for common issues that still require manual work
Empower individuals to ask questions and take risks

Developers who are empowered to support their own code build better products, period. These developers begin to think about their code in terms of reliability and resiliency while they develop, rather than as an afterthought, if they think about those aspects at all.

When you’re on call, you’re expected to be available to respond to any incidents that may arise. Some folks split workdays into on-call shifts. For example, Tim is on call from 8:00 a.m. to 10:00 a.m. every morning. Others cover nights and weekends on a rotating schedule. If this approach works for you and your team, go for it!

Based on my experience, I suggest something a little different. People do their best work when they have extended periods of time away from being “on,” and that means having full days without having to worry about being paged.

In 2010, LexisNexis conducted a survey of 1,700 office workers in several countries. The study found that employees spend more than half their day receiving information rather than putting that data into practice. Half the respondents said that they were approaching a mental breaking point from being overwhelmed with information. Breaks are a critical aspect of productivity and work-life balance.

Figures 17-1 and 17-2 show some example schedules. Figure 17-1 shows how two people can share daily, on-call duties while keeping at least three clear days in their week. Figure 17-2 divides the duties among four people. Each is required to be on call at least one day per week but no more than three days per week. Each shade represents a different person. The columns are days of the week and the rows are weeks (four rows represent a typical month).

Table depicting days of week as column heads with W1 to W3 row heads and Ann, Dev and Tim, Ops scheduled. — FIGURE 17-1: An example of a two-person on-call schedule.

Table depicting days of week as column heads with W1 to W3 row heads and Ann, Dev; Don, Dev; Mel, Ops; and Tim, Ops scheduled. — FIGURE 17-2: An example of a four-person on-call schedule.

Each person is on call from 5:00 p.m. to 5:00 p.m., which is simple if you’re all in the same office. If your organization is remote-first or remote-friendly, you need to choose a single time zone for everyone to follow to ensure 24/7 coverage.

On-call rotations come many forms. The examples I provide are intended to help you get going, not limit you. You should tailor the schedule to make it work best for your team. If you have a globally distributed team, you can adopt a follow-the-sun rotation that puts engineers on call during normal business hours before they pass the responsibility to those working normal business hours in a different time zone. Find the days, times, and frequency of on-call rotations that can balance incident management with humane on-call practices.

Managing Incidents

In my talk “This Is Not Fine: Putting Out (Code) Fires,” (https://www.youtube.com/watch?v=qL2GFB3mSs8&t=69s), I speak a lot about incident management and how it relates to another type of firefighting — the kind with actual flames. Engineers and operations pros can take a lesson from the way firefighters prioritize how they combat incidents that are way more dangerous than tech failures and apply those steps to addressing incidents. (See the sidebar “Putting out code fires” for more info on how firefighting principles can work in tech.)

The incidents you deal with in tech can sometimes be as simple as an odd user interface bug in a drop-down list, which isn’t exactly life-threatening or worthy of a hotfix at 4 o’clock in the morning. Sometimes, though, your software goes wrong in spectacularly terrible ways. For example, in 2003, a performance issue in utility software caused a blackout in the American Northeast. And in 2000, radiation therapy software in Panama failed to account for a workaround used by doctors, resulting in eight patient deaths and another 20 radiation overdoses.

These situations are vastly different from simple bugs and performance issues. Yes, a slow site loses money and causes customer disruption. But having people express anger at you on Twitter is much less stressful than having people die or watching your company go bankrupt by the minute.

Making consistency a goal

If you’ve ever flown in a private plane, you know how much pilots love checklists. Well, maybe they don’t love them, but they certainly use them. Checklists are a big part of why air travel is by far the safest way to get from point A to point B.

For pilots, these checklists are part of a preflight flow that checks switches, circuit breakers, and emergency equipment. Pilots run through this process before every flight, with no exceptions. This consistency moves the process beyond regular consciousness and into muscle memory. Pilots with even just a few years of experience don’t need to think about their preflight flow; it’s automatic.

Along the same lines, you should create an incident checklist for your team that your team will automatically follow when it’s needed. If you’re not sure what to include, start with these actions:

Notify appropriate colleagues. Depending on who is involved in the incident, keep up-to-date contact information for everyone on your team.
Deploy a status page. Inform customers what service or features are affected. Be sure to include the contact information for your support team and the time of the last update.
Rate the incident. Your checklist should include clearly defined severity ratings to help the first responders appropriately escalate an incident to legal or executive management.
Schedule a post-incident review. Post-incident reviews are a key part of reducing human error and building resilient systems. How else do people learn if not through mistakes? If possible, schedule it within 36 hours of the incident.

Adopting standardized processes

The more you standardize your emergency preparation, the more people you can rely on to step in and help fix the problem. If only one person can address a certain issue, that person becomes a single point of failure, which is absolutely unacceptable in modern tech companies.

Make the checklists and incident response protocols available to everyone on your team — even the folks who aren’t on call. Making them available to everyone ensures that the entire company is on the same page and eliminates needless questions from teams like customer support during an incident.

To fully adopt DevOps practices, developers must store the source code in a place that the ops teams can access. Also, give developers access (at least read-only) to all logs and machines. This approach enables both sides to dig into all areas of the tech — source code and infrastructure — without asking for permission. The alternative is to rely on people from other teams to be couriers of information — a time-intensive and inefficient process.

Establishing a realistic budget

The roots of many of the popular trends in tech are in large companies that adopted a certain tool or practice. For example, site reliability engineering wasn’t a well-known concept or role until Google published Site Reliability Engineering: How Google Runs Production Systems. React, a JavaScript library, took off in popularity largely because Facebook developed and promoted it.

Your company may not have the financial resources of companies like Microsoft, Google, Amazon, and others, so your incident response procedures need to be designed with a budget in mind. Monitoring every service is impossible. Instead, focus on the ones that your company uses the most frequently or that have the greatest impact to your customers. I strongly recommend centralized logging to create a way for logs to be captured at increasingly larger intervals as time goes on. In other words, find a balance between visibility and budget in storing log data and performance metrics.

LESSON LEARNED: NETFLIX RESILIENCY

In February 2017, Amazon’s S3 (web-based storage) experienced widespread issues in the US-EAST-1 region. It effectively brought down much of the Internet — including Amazon’s own status page. Netflix was one of the only major websites not to experience any issues, even as a customer of AWS.

Netflix, it turns out, had learned this lesson five years earlier when a storm knocked the site offline for about three hours. In the post-incident review, Netflix realized that it was vulnerable to regional outages. A company like Netflix loses hundreds of thousands of dollars for every hour of downtime, if not more.

The solution, for Netflix, is to switch availability zones in AWS automatically when one goes down. Users will never be affected by regional service disruptions. This solution is expensive, however, because you can never max out capacity and balance your users efficiently across zones. Otherwise, you wouldn’t have the volume available to move users to another region when one fails.

The costs for this type of solution add up quickly and are prohibitive for many companies. Budget constraints are a vital piece of your overall strategy and should inform many of your decisions.

Making it easy to respond to incidents

Incident management protocols must be generic enough to respond to events with varying levels of urgency and importance. They should also maintain clear procedures for people to follow while they’re rubbing sleep from their eyes in the middle of the night and trying to wrap their brains around the problem. Following are a few tips that can help your engineers master incident management:

Make it easy and acceptable to escalate. You’re better off overreacting rather than underresponding to a situation. The primary person on call should be able to page the secondary engineer on call without retribution.
Use a single communication tool. When different teams within an engineering organization use multiple communication tools, absolute chaos during an incident can ensue. Engineers must be on the same page, and being able to scroll back through conversations or reach a colleague quickly via a video conferencing tool is essential. Always use the same medium to reach your coworkers. I highly recommend using a chat app like Slack or hopping on a set incident video conferencing call via a tool like Zoom.

Every method of communication comes with pros and cons. Using video calls to communicate during an incident creates a more fluid experience for the engineers on call but limits your ability to include that information in the post-incident review. Group chats, such as Slack, aid you in better capturing the timeline of an incident response but may create confusion for the engineers responding. (Messages written in haste tend to be short and lack the detail and context that you could provide verbally in a fraction of the time.) Two compromises exist: Record the video calls or have someone summarize events for the group in a written format.
Standardize the initial investigation. Create a step-by-step list so that any engineer can quickly begin to triage a situation. Is there a widespread AWS outage that’s causing half of the Internet to go down? If not, monitoring tools and logs will be your best bet to home in on the problem. Only if all else fails is it appropriate to allow engineers to “sniff test” the issue and follow their gut.

Cloud computing services like AWS and Azure host multiple locations around the world. Every location is composed of regions and availability zones. A region is a geographic area. AWS has US-EAST-1 in Northern Virginia and AP-SOUTHEAST-1 in Singapore are considered regions, for example. Multiple availability zones exist within each region.

Urgency is not the same as importance. The distinction between these two qualities comes into play when you are discussing on-call procedures. Urgency defines how rapidly something must be resolved. The site’s down? That’s pretty urgent. Customers can’t make purchases? Also urgent. A rarely used API is failing gracefully? Not urgent. Important, but not urgent.

Important incidents that lack urgency can wait until the morning when an engineer can give their best effort to fix the issue. Making this simple distinction will save your team from buggy fixes and prevent your engineers from becoming needlessly burnt out.

Responding to an unplanned disruption

In any situation, it’s always best to assume the worst. As mentioned in the previous section, escalating a situation and treating it as a more severe incident is always better than underreacting.

Also, decisions should be made quickly during a crisis. Hierarchy is always going to be a controversial topic in tech. But, especially when responding to incidents, I recommend a strong response hierarchy with designated roles. Your team should include an incident commander (IC), a tech chief, and a communications chief.

Different resources include various version of the number and type of incident roles. You may hear things like first responders, secondary responders, subject matter experts, and communication liaisons. I choose to focus on the three I’ve listed because they cover the three most important roles of an incident response: someone to make decisions, someone to lead engineers in the technical response, and someone to record the details of the incident. Feel free to experiment with your incident-response procedures and find what works best for you and your organization.

Think of the primary person on call as the first one on the scene. They will not necessarily be the person most equipped to handle the particular issue. In fact, the person doesn’t have to be an engineer at all. The primary person on call is simply the person who triages the issue. This person is tasked with assigning a degree of urgency to the alert.

Make sure that you rotate incident teams, just as you do in your on-call rotation. Rotating teams enables people with different skills and interests on your team to become proficient — and more confident — in other areas. Every person on your team should have the opportunity to be trained and serve in each role. Figure 17-3 illustrates an incident response hierarchy. The incident commander will oversee and provide resources for the tech chief and the comm chief, including supplying them with the appropriate number of engineers to assist (represented by the small boxes beneath each chief).

Image described by caption and surrounding text. — FIGURE 17-3: A typical incident response hierarchy.

You can see how this hierarchy is put into action in the following steps, which outline the procedure for handling an unplanned disruption.

Make an initial assessment.

At the start of an incident response, the IC begins sizing up the situation. Be sure to categorize and prioritize the incident. Categorization doesn’t have to follow a particular pattern, but your classes of incidents should enable you to group similar incidents and evaluate trends. Prioritization is centered around urgency. Is this customer-impacting? How wide-spread is the incident? How many engineers might be required to help fix it? The IC determines how many engineers the tech chief needs to notify.
Communicate during triage.

I suggest hopping on a video call to discuss the disruption. Zoom and other video conference tools help you communicate in real time. Although Slack and other messaging tools have become part of everyday communication, the power of face-to-face communication, especially during a crisis, is critical. Your engineers need to communicate with each other verbally while their fingers are busy logging into machines or digging into code. If you opt for a messaging tool like Slack, you’ll be able to include that transcript in the post-incident review. If you triage on a video call, be sure to designate one person to record who said what and which solutions were attempted.

A societal norm exists for the women you work with to default into administrative or non-technical roles. You can see this in who most frequently ends up being the person to record the conversation or serve as comm chief. Be sure to watch for this gender-biased default and counter it by ensuring that engineers who don’t identify as male also serve as incident commanders and tech chiefs.
Add engineers as necessary.

After you dig into the incident, you may realize that you need a subject matter expert who is particularly equipped to deal with the type of incident you’re experiencing. They could be deeply trained in the particular tool or technology, or they may be the engineer who implemented a specific function.
Resolve the issue.

It’s easier said than done, but the engineers responding to the incident will eventually discover the steps necessary to restore service. At that point, the comm chief can relay important information to key internal and external stakeholders, the IC can schedule a post-incident review (if they haven’t already) and the tech chief can help engineers schedule rest and recovery before the post-incident review.

LEARNING FROM THE MISTAKES OF OTHERS

You’re never the first to fail. Even when it feels as if you’re the only one who could have fallen on your face in such a spectacular and unique way, you’re not, I promise. Although tech isn’t new, it has reached saturation in the developed world, which means that you have plenty of resources improve your avoidance of and approach to incidents.

In January 2017, GitLab, a git-repository hosting service and manager, experienced a site outage because of the accidental removal of primary database. GitLab was down for 18 hours. That’s enough to give any engineer heart palpitations. To its credit, the company was extremely transparent about the event, going so far as to keep notes in a public Google document and livestream its recovery on YouTube. The full post-incident review of the event as well as the data loss outcomes are well worth the read.

Ultimately, GitLab discovered that it had two problems:

GitLab.com had an unplanned disruption after the wrong directory was removed. The primary database directory rather than the intended secondary database directory was removed. Replication stopped because of a spike in load. Restoring database replication, after it was stopped, required a manual process that was poorly documented, in this case.
Restoring the site required a copy of the staging database. This database was stored on a slower Azure VM. Disk snapshots weren’t enabled and attempts to back up the database failed silently because of a PostgreSQL versioning issue.

Could there be a more perfect storm? Here’s my favorite part of the published postmortem: “Why was the backup procedure not tested on a regular basis? Because there was no ownership, [and] as a result nobody was responsible for testing this procedure.”

This cascading series of failures could affect any organization. No one is exempt from those unknown unknowns of unplanned downtime. The distinctions between organizations that let failures overwhelm them and those that use the same incidents as learning opportunities are attitude and preparedness.

GitLab was brutally honest with its customers and has since improved its recovery procedures. Many companies would have revoked production privileges from engineers, thus creating a bottleneck. Instead, GitLab made it more obvious for engineers which host they’re using.

The worst incident response I’ve ever witnessed was so traumatic that I don’t even remember what went wrong. I distinctly remember how it transpired, however. A major disruption of service brought a small startup down for hours, and sometime in the early morning, the CEO called the two most senior engineers — who, at that point, had been troubleshooting without a break for half a day. I’ve never heard a man make so many threats over the phone. The CEO assured the engineers that the situation was their fault and promised that if they didn’t fix it soon, not only would he fire them, he would make sure that they were never hired at a venture capital-funded startup ever again.

Put yourself in the shoes of those engineers. You’ve been working for hours. You haven’t eaten. You’ve barely had time to make coffee, which is the only thing keeping you moving at this point. The CEO’s blame and threats were enough to put anyone in a state of panic, which is about the worst thing you could do to engineers who are working on fixing the issue.

That CEO made a critical error: Distracting the engineers who were working so diligently for him. He took their attention away from the emergency they were triaging and put their attention on their future. No matter how chaotic and stressful an incident becomes, always remember that the folks working to remedy the situation are doing their best and care about fixing the issue as successfully and quickly as possible.

Every organization will experience a major incident at some point; it’s inevitable. But how you prepare for those incidents and cope with them in the moment is what separates teams who embrace DevOps and those who don’t.

Empirically Measuring Progress

More and more companies are beginning to develop a DevOps culture and implement change within their organizations, yet most don’t measure incident response. In fact, most companies don’t even know which metrics matter. Success in incident management doesn’t go from zero to perfect, and achieving it is hard. But the best way to improve your success is to start gathering and analyzing metrics. This section provides some metrics for you to start observing and tracking. If you’re just getting started, now is not the time to start setting goals or adding these measurements to personnel reviews. Instead, think of them as single points of data that together paint a broader picture of your company’s success.

I want to be clear about one thing. I’ve chosen to put this information as the last part of this chapter for a particular reason: It’s the least important. The metrics in this section are simply data points that serve as the foundation of a larger organizational conversation. These are never meant to be the only measure of success. Instead, track them as a way of measuring the progress of your team as they continuously improve their incident management.

Mean time to repair (MTTR)

The mean time to repair refers to the average time your business is impacted during incidents. When collecting this metric, also include latency, the time from when the failure first occurred to when it was detected. You likely calculate latency after the incident is resolved so that you can reasonably estimate, via logs and other data, when the failure began to impact the affected service before an engineer realized it was a problem. The formula looks like this:

MTTR = total time of impact / number of incidents

People also sometimes use MTTR to describe the mean time to recovery, the amount of time your team takes to resolve an issue as well as mean time to respond, or the time an organization takes to acknowledge and initiate a response to a problem. (Remember, a mean assumes normal distribution, and an 18-hour outage like GitLab experienced will exaggerate their response time. MTTR is just one data point.)

Mean time between failures (MTBF)

In short, MTBF is the average uptime for a service between incidents. The higher an organization’s mean time between failures, the longer the service can be expected to work without interruption. Here’s the formula:

MTBF = total uptime / number of incidents

Although MTBF can provide a helpful piece of data, many DevOps organizations are moving away from tracking MTBF because failures simply can’t be avoided. You could instead track customer-impacting incidents (rather than service failures of which the user is never aware).

Cost per incident (CPI)

The cost per incident is simply how much money your company lost because of the service interruption. This calculation has two phases. The first is how much the actual incident cost you: Were customers unable to make purchases? The second is the cost of bringing your services back online: How many engineers were required to address the issue? Here are the formulas:

Lost revenue (LR) = average revenue * time
Cost to restore (CR) = number of engineers * average hourly salary * time
CPI = LR + CR

CPI adds up fast. You can use these calculations to convince even the most stubborn executives to put resources toward preparing for incidents, paying down tech debt, testing more rigorously, and improving application security.

DevOps Research and Assessment (DORA) goes further than CPI and calculates the cost of downtime using the following formula:

Cost of downtime = deployment frequency * change failure rate * mean time to recover (MTTR) * hourly cost of outage

You can read more about calculating your cost of downtime at https://victorops.com/blog/how-much-does-downtime-cost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17: Preparing for Incidents

Create new playlist

Sign In

Sign Up