Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6 Recovering from Production Failures

We live in an imperfect world. We first see bugs escape into our production environment. Then, we may find as we start moving to DevOps practices, there are gaps in our understanding that affect how we deliver in our production environment. As we get those fixed, we may encounter other problems that are outside our control. What can we possibly do?

In this chapter, we will examine mitigating and dealing with failures that happen in production environments. We will look at the following topics:

The costs of errors in production environments
Preventing as many errors as we can
Practicing for failures using chaos engineering
Resolving incidents in production with an incident management process
Looking at fixing production failures by rolling back or fixing forward

Learning from failure

Production failures can happen at any time in the product development process, from the first deployment to supporting a mature product. When these production failures happen, depending on the impact, they may adversely affect the value the customer sees and potentially ruin a business’s reputation.

Often, we don’t see the lessons offered by these production failures until the failures happen or afterward when reading about such failures happening to another organization (or even a competitor!).

We will examine a sample of such famous production failures, hoping to glean lessons through the benefit of hindsight. The following examples include the following:

The rollout of healthcare.gov in 2013
The Atlassian cloud outage in 2022

Other lessons will come from other sections in this chapter.

healthcare.gov (2013)

In 2010, the Patient Protection and Affordable Care Act became law in the USA. A key part of this law, colloquially known as Obamacare, was the use of a website portal called healthcare.gov that allowed individuals to find and enroll in an affordable health insurance plan through multiple marketplaces. The portal was required to go online on October 1, 2013.

healthcare.gov was released on that date and immediately encountered problems. The initial demand upon launch, 250,000, was five times what was expected and caused the website to go down in the first 2 hours. By the end of the first day, a total of six users had successfully submitted applications to enroll in a health insurance plan.

A massive troubleshooting effort ensued, eventually allowing the website to handle 35,000 concurrent users, and registering 1.2 million users to a health insurance plan before the enrollment period closed in December 2013.

One of a number of reports looking at the healthcare.gov debacle was written by Dr. Gwanhoo Lee and Justin Brumer. In the report (seen at https://www.businessofgovernment.org/sites/default/files/Viewpoints%20Dr%20Gwanhoo%20Lee.pdf), they specified the challenges of such a massive undertaking, which included the following:

A complex IT system in a limited period of time
Policy problems that created uncertainty in their implementation
High-risk contracting with limited timeframes
A lack of leadership

Lee and Brumer also identified a series of missteps from the early stages of the design and development of the portal that would serve to doom the project. These included the following:

A lack of alignment between the government policy and the technical implementation of the portal
Inadequate requirements analysis
Failure to identify and mitigate risks
Lack of leadership
Inattention to bad news
Rigid organizational culture
Inattention to project management fundamentals

Fixing healthcare.gov

One of the efforts that came about after the disastrous initial launch of healthcare.gov was the Tech Surge, a takeover by software developers from Silicon Valley, who refactored major parts of the healthcare.gov website. The teams in the Tech Surge operated as small teams with a start-up mentality and were accustomed to the use of Agile practices that brought close collaboration, DevOps tools such as New Relic, and cloud infrastructures.

One of the offshoots from the Tech Surge was a small group of coders led by Loren Yu and Kalvin Wange, known as Marketplace Lite (MPL). They started as part of the Tech Surge and worked with existing teams at the Centers for Medicare and Medicaid Services (CMS), showing them new practices, such as collaborating over chat instead of emails, as they rewrote the parts of the website to log in and register for a new plan.

MPL continued to work on healthcare.gov as many of the contracts ran out for other developers of the Tech Surge. It continued to work alongside CMS to improve systems testing and deliver fixes incrementally, as demonstrated in a Government Accountability Office (GAO) report at the time. The efforts were starting to bear serious fruit. One of the rewritten parts that MPL worked on, App 2.0, the tool to register for new healthcare insurance, was soft-launched for only call centers but became so successful that it was the main tool for registering new applications for those who had a simple medical history.

The work of MPL and the Tech Surge and the success of subsequent rollouts of healthcare.gov in further enrollment periods provided a proving ground for Agile and DevOps mindset and practices. Agencies such as 18F and the United States Digital Service took up the baton and began the job of coaching other federal agencies to apply Agile and DevOps to technology projects.

Atlassian cloud outage (2022)

On April 5, 2022, 775 of Atlassian’s more than 200,000 total customer organizations lost access to their Atlassian cloud sites, which served applications such as Jira Service Management, Confluence, Statuspage, and Opsgenie. Many of these customers remained without access for up to 14 days until service to the remaining sites was restored on April 18.

The root cause of the site outage was traced to a script used by Atlassian to delete old instances of Insight, a popular standalone add-on to Jira that was acquired by Atlassian in 2021. Insight eventually became bundled into Jira Service Management, but traces of the legacy app remained and needed to be removed.

A miscommunication occurred where the team responsible for running the script was given a list of site IDs as input for the script instead of a list of Insight instance IDs. What followed was the immediate deletion of sites.

Atlassian’s cloud architecture is composed of multi-tenant services that handle applications for more than one customer. A blanket restoration of services would have affected those customers whose sites weren’t deleted. Atlassian had the knowledge of how to restore a single site but had never anticipated needing to restore the number of sites they currently faced. Atlassian began restoring customer sites. Restoration of a bunch of sites would take 48 hours. A manual restoration effort would have taken weeks for all the missing sites; clearly, Atlassian needed to automate.

The automation effort was designed as Atlassian figured out a method to restore multiple sites at once. The automation was run starting on April 9 and accomplished the restoration of a site in 12 hours. Roughly 47% of the total sites were restored using automation by the time the last site was restored on April 18.

The bigger issue was communicating with the affected customers. Atlassian was first made aware of the incident through a customer support ticket, but it was not immediately aware of the total number of affected customers. This is because deleting the sites also deleted metadata containing customer information that would be used by the customer to create support tickets. Recovering the lost customer metadata was important for customer notification.

The inability of Atlassian to directly contact affected customers made a major communication problem even greater. Those customers not contacted by Atlassian began reaching out on social media sites such as Twitter and Reddit to get news of what had happened. A general tweet from Atlassian made its way on April 7. A blog article from Atlassian’s CTO, Sri Viswanath, with more detailed explanations came on April 12. After the incident was solved, a post-incident review report was made generally available on April 29.

Lessons from the Atlassian outage

The Atlassian outage provided challenges both from a technical and a customer service perspective. The post-incident review outlined four major learning point that Atlassian must improve upon to prevent similar outages from occurring. These learnings included the following:

Changing the process of deleting production data to soft deletes, where it is easier to recover and will be removed only after a certain period of time has elapsed.
Looking at specific processes for multiple-site, multiple-product data for a larger set of affected customers.
Considering incident management for large-scale events. Atlassian had processes in place for one customer’s site. It now needed to consider large-scale incidents, affecting a large number of customers.
Improve customer communication during an incident. Atlassian started communication when it had a grasp of the cause and the efforts needed to correct the incident. This delay in communication allowed the incident to play out on social media.

But there are broader lessons for us as well. Failures in production can happen from the first deployment to any point in the life cycle of a product. Failures can happen to companies starting out with Agile and DevOps or companies such as Atlassian that have succeeded in using Agile and DevOps. With all of these companies, the key to handling production failures includes ensuring that the process roots out as many failures as possible, practicing for failures to determine the best process, and setting up a proper process when failure does occur.

To aid us in this, we turn to a growing discipline called Site Reliability Engineering (SRE). This discipline was created by Ben Treynor Sloss at Google to initially apply software development methods to system administration. Originally seen as a hybrid approach utilizing methods used in development groups for traditional system administration operations, SRE has grown to its own branch within DevOps to ensure continued reliable systems operations after automated deployment has occurred.

The first step is planning and prevention. Let’s start looking at the safeguards used by SRE for preventing production failures.

Prevention – pulling the Andon Cord

The Andon Cord holds a special place in Lean thinking. As part of the Toyota Production System, if you suspected a problem with a car on the assembly line, you would pull the cord that ran around on the assembly line, and it would stop the line. People would come to the spot where the Andon Cord was pulled to see the defect and determine, first, how to fix the defect, and second, what steps would be needed to prevent the defect from occurring in the future.

Taiichi Ohno, the creator of the Toyota Production System, uses the Andon Cord to practice jidoka, empowering anyone to stop work to examine and implement continuous improvement.

For site reliability engineers, the following ideas and principles are used as a way of implementing the Andon Cord and ensuring continuous improvement:

Planning for risk tolerance by looking at service-level indicators (SLIs), service-level objectives (SLOs), and error budgets
Enforcing release standards through release engineering
Collaborating on product launches with launch coordination engineering

Let’s examine these ideas in closer detail.

SLIs, SLOs, and error budgets

Many people are familiar with the concept of service-level agreements (SLAs), where if the service does not meet a threshold for service availability or responsiveness, the vendor is then liable to pay for that agreed-upon level of performance, typically in the form of credits.

If we take a look at the goal or threshold that an SLA is expected to achieve or maintain, that is called an SLO. Generally, there are three parts to an SLO:

The quality/component to measure
The measurement periods
The required threshold the quality must meet, typically written as a desired value or range of values

That quality or component to measure is known as an SLI. Common SLIs that are typically used include the following:

Latency
Throughput
Availability
Error rate

For every SLO, the time inside the measurement period where the threshold is not met is known as the error budget. Closely monitoring the error budget allows SREs to gauge whether the risk is acceptable to roll out a new release. If the error budget is almost exhausted, the SRE may decide that the focus should change from feature development to more technical work, such as enablers, which would enhance resiliency and reliability.

Teams generally want to understand an error budget in terms of allowable time. The following table may provide guidance on the maximum allowable error on a monthly and an annual basis:

SLO Percentage	Monthly Allowed Error Budget	Annual Allowed Error Budget
99% (1% margin of error)	7 hours, 18 minutes	87 hours, 39 minutes
99.5 (0.5% margin of error)	3 hours, 39 minutes	43 hours, 49 minutes, 45 seconds
99.9% (0.1% margin of error)	43 minutes, 50 seconds	8 hours, 45 minutes, 57 seconds
99.95% (0.05% margin of error	21 minutes, 54 seconds	4 hours, 22 minutes, 48 seconds
99.99% (0.01% margin of error)	4 minutes, 23 seconds	52 minutes, 35 seconds

Table 6.1 – Error budgets in terms of allowable monthly and annual time

The journey to implementing SLOs often begins with an evaluation of the product or service. From there, look at the components or microservices that make up the product: which parts of these, if not available, would contribute to unhappiness for the customer?

After the discovery of components critical to customer happiness, choose those measurements (SLIs) to capture and set up your goals (SLOs), making sure that the measurements give true indicators of potential problems and that the goals are realistic and attainable (100% of any measurement is not attainable). Start with a small set of SLOs. Communicate these SLOs to your customer so that they understand the role SLOs will play in making a better product and the expectations.

SLIs, SLOs, and error budgets should be documented as policy, but the policy is meant to change and adjust. After some time, reevaluate the SLIs, SLOs, and error budget to see whether these measurements are effective, and revise the SLIs and SLOs as needed.

Release engineering

To ensure that SLOs are maintained, site reliability engineers need to ensure that anything that is released to a customer is reliable and may not contribute to an outage. To that end, they work with software engineers to make sure releases are low-risk.

Google details this collaboration as release engineering. This aspect of SRE is guided by the following principles:

Self-service
High velocity
Hermetic builds
Policy/procedure enforcement

Let’s look at these four parts of the release engineering philosophy now.

Allowing release autonomy through a self-service model

For agility to prosper, the teams working must be independent and self-managing. Release engineering processes allow the teams to decide their own release cadence and when to actually release. This ability for teams to release when and how often they need to is aided by automation.

Aiming for high velocity

If teams choose to release more often, they are often doing so with smaller batches of changes of highly tested code. More frequent releases of small changes reduce the risk of outages. This is especially helpful if you have a large error budget.

Ensuring hermetic builds

We want consistency and repeatability in our build-and-release process. The build output should be identical no matter who creates it. This means that versions of dependent artifacts and tools such as libraries and compilers are standardized from test to production.

Of course, if problems occur out in production, a useful tactic for troubleshooting is known as cherry-picking, where the team starts with the last-known good production version, retrieved from version control, and inserts each change one by one until the problem is discovered. Strong version control procedures ensure that builds are hermetic and allow for cherry-picking.

Having strongly enforced policies and procedures

Automated release processes that produce hermetic builds require standards of access control to ensure that builds are created on the correct build machines using the correct sources. The key is to avoid adding local edits or dependencies and only use verified code kept in version control.

These four principles we have discussed are really applied when looking at the automation that handles the following parts of the release process:

Continuous integration/continuous deployment (CI/CD)
Configuration management

We first saw these parts as automated implementations of the CI/CD pipeline in Chapter 3, Automation for Efficiency and Quality. Now, let’s see how we tie the process into the automation.

CI/CD

The release process begins with a commit made to version control. This starts the build process with different tests automatically executed depending on the branch. Release branches run the unit tests as well as applicable system and functional tests.

When the tests pass, the build is labeled so that there is an audit trail of the build date, dependencies, target environment, and revision number.

Configuration management

The files used by the configuration management tools are kept in version control. Versions of the configuration files are recorded with release versions as part of the audit trail so that we know which version of the configuration files is associated with which versions of the release.

Launch coordination engineering

Launching a new product or feature to customers may have greater expectations than iterative releases of existing products. To facilitate the release of new services, Google created a special consulting function within SRE called launch coordination engineering (LCE).

The engineers in LCE perform a number of functions, all intended to ensure a smooth launch process. These functions include the following:

Auditing the product or service to ensure reliability
Coordinating between multiple teams involved in the launch
Ensuring completion of tasks related to technical aspects of the launch
Signing off that a launch is safe
Training developers on integration with the new service

To aid launch coordination engineers in ensuring a smooth launch, a launch checklist is created. Depending on the product, engineers tailor the checklist, adding or removing the following checklist items:

Shared architecture and dependencies
Integration
Capacity planning
Possible failure modes
Client behavior
Processes/automation
Development process
External dependencies
Rollout planning

We have seen techniques and processes SREs use to ensure that the product launch or code release is ready. We’ve seen the tolerance for failure through SLIs, SLOs, and error budgets. But do we know whether the SREs are ready if an outage occurs?

One way of determining is by simulating a failure and seeing the reaction. This is another tool that SREs use, called chaos engineering. Let’s take a look at what’s involved.

Preparation – chaos engineering

On September 20, 2015, Amazon Web Services (AWS) experienced an outage with more than 20 services out of its data centers in the US-EAST-1 region. These services affected applications from major companies such as Tinder, Airbnb, and IMDb, as well as Amazon’s own services such as Alexa.

One of AWS’s customers that was able to avoid problems during the outage and remain fully operational was Netflix, the streaming service. It was able to do so because it created a series of tools that it called the Simian Army, discussed in this blog article at https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116, which simulated potential problems with AWS so that Netflix engineers could design ways to make their system more resilient.

Over several AWS outages, the Simian Army proved its worth, allowing Netflix to continue providing service. Soon, other companies such as Google started wanting to apply the same techniques. This groundswell of support led to the creation of the discipline of chaos engineering.

Let’s take a closer look at the following aspects of chaos engineering:

Principles
Experiments

Chaos engineering principles

The key to chaos engineering is experimentation in production environments. The idea of performing reliability experiments in production does seem to be laden with risk. This risk, though, is tempered by your confidence in the resiliency of the system.

To guide confidence, chaos engineering starts with the following principles:

Build your hypothesis around the steady-state behavior of your production environment
Create variables that simulate real-world events
Run the experiment in your production environment
Automate the experiment
Minimize the experiment’s fallout

Let’s discuss these principles in detail.

Basing experiments around steady-state behavior

In devising our experiments, we really want to focus on the system outputs rather than the individual components of the system. These outputs form the basis of how our environment behaves in a steady state. The focus in chaos engineering is on the verification of the behavior and not on the validation of individual components.

Mature organizations that look at chaos engineering as a key part of SRE know that this steady-state behavior typically forms the basis for SLOs.

Creating variables that simulate real-world events

Given the known steady-state behavior, we consider what-if scenarios that happen in the real world. Each event you consider then becomes a variable.

One of the famous tools in Netflix’s Simian Army, Chaos Monkey, was based on the event that a virtual server node in AWS would become unavailable. So, it tested for that condition only.

Running the experiment in production

Running the experiment in a staging or production-like environment is beneficial, but at some point, you need to run the experiment with its variable in the production environment to see the effects on real-world processing of real traffic.

At Netflix, Chaos Monkey was run every day in production. It would look at every cluster in operation and randomly deactivate one of the nodes.

Automating the experiment

The benefits of learning from chaos engineering experiments are only apparent when experiments are run consistently and frequently. To achieve this, automating the experiment is necessary.

Chaos Monkey was not initially popular with Netflix engineers when initially rolled out. The idea that every day, this program would intentionally cause errors in production did not sit well with them, but it did consistently raise the problem that instances could vanish. With this problem, engineers had a mandate to find solutions and make the system more resilient.

Minimizing the fallout

Because you are running your experiment in the production environment, your customers who are also using that environment may be affected. Your job is to make sure the fallout from running the experiment is minimized.

Chaos Monkey was run once per day, but only during business hours. This would be to ensure that if any ill effects to production were discovered, it would be while most of the engineers were present and not off-hours such as at 3 A.M. when there would only be a skeleton crew.

With these principles in place, let’s apply them and look at creating experiments.

Running experiments in chaos engineering

Experimenting in production involves planning and developing a process. The goal of the experiment is to find those weak areas that could be more resilient to ensure SLOs are kept. The goal is not to break the system.

In Chaos Engineering: System Resiliency in Practice, Richard Crowley writes a chapter dealing with creating the Disasterpiece Theater process for Slack. He outlines the following steps for the process:

Make sure a safety net is in place.
Prepare for the exercise.
Run the exercise.
Debrief the results of the exercise.

Let’s examine the details of each step now.

Ensuring the environment is ready for chaos

The goal of chaos engineering is to find weaknesses in resiliency, not to disable the environment. If the existing environment has no fault tolerance, there’s no point in running experiments.

Make sure there is spare capacity for services. That spare capacity should be easy to bring online.

Once the spare capacity and resources have been identified and allocated, have a plan to allow for the removal of malfunctioning resources and automatic replacement with the spare capacity.

Preparing for the exercise

For Crowley, an exercise starts with a worry: Which critical component or service will fail, impacting resiliency? This becomes the basis for the exercise.

Crowley then takes this basis and works on expanding this to an exercise to run in development, staging, and production environments. He sets up a checklist, making sure each of the following items is fulfilled for the exercise:

Describe the server or service that is to fail, including the failure mode, and how to simulate the failure.
Identify the server or service in development and production environments, and confirm the method to simulate is possible in the development environment.
Identify how the failure should be detected. Will an alert be produced that will show up on dashboards? Will logs be produced? If you cannot imagine how it will be detected, it may still be worth running the exercise to determine a way to detect the failure.
Identify redundancies and mitigations that should eliminate or reduce the impact of the failure. Also, identify any runbooks that are run if the failure should occur.
Identify the relevant people that should be invited to contribute their knowledge to the exercise. These people may also be the first responders when the exercise happens.

Preparation culminates in a meeting with the relevant people to work out the necessary logistics of the exercise. When all the preparations are set, it’s time to run the exercise.

Running the exercise

The exercise should be well publicized to all involved people before it is executed. After all, they will be participating in the exercise, with the goal of creating a more resilient environment.

Crowley executes the exercise with the following checklist:

Make sure everyone is aware the exercise is being recorded. Make a recording if everyone allows it.
Review the plan created in the preparation step. Make adjustments as necessary.
Announce the beginning of the exercise in the development environment.
Create a failure in the development environment. Note the timestamp.
See whether alerts and logs are created for the failure. Note the timestamp.
If there are automated recovery steps, give them time to execute.
If runbooks are being used, follow them to resolve the failure in the development environment. Note the timestamp and whether any deviations from the runbooks occurred.
Have a go/no-go decision to duplicate this failure in the production environment.
Announce the beginning of the exercise in the production environment.
Create a failure in the production environment. Note the timestamp.
See whether alerts and logs are created for the failure. Note the timestamp.
If there are automated recovery steps, give them time to execute.
If runbooks are being used, follow them to resolve the failure in the production environment. Note the timestamp and whether any deviations from the runbooks occurred.
Announce the all-clear and conclusion of the exercise.
Perform a debrief.
Distribute the recording if one was made.

With the exercise complete, a key part of the exercise begins with the debrief, after the all-clear is announced. Let’s examine how to create a learning debrief.

Debriefing for learning

Crowley recommends performing a debrief while the experience of the exercise is still fresh in everyone’s minds. During the debrief, only the facts are presented, with a summary of how well the system did (or didn’t) perform.

Crowley has the following starter questions to help display what was learned. These are offered as a template:

What was the time until detection? What was the time until recovery?
Did the end users notice when we ran the exercise in production? How do we know? Are there solutions to make that answer no?
Which recovery steps could be automated?
Where were our blind spots?
What changes to our dashboards and runbooks have to be made?
Where do we need more practice?
What would our on-call engineers do if this happened unexpectedly?

The outcome of the exercise and the answers in the debrief can form recommendations for the next steps to add resiliency to the system. The exercise can be repeated to ensure that the system correctly identifies and resolves the failure.

Disasterpiece Theater can be an effective framework for performing your chaos engineering exercises. The flexibility of the exercise is dependent upon how resilient your system is already.

Even with regular chaos engineering exercises, bad things can still happen in your production environments that will affect your customers. Let’s look at ways to solve these production issues with incident management.

Problem-solving – enabling recovery

For SREs, a solid incident management process is important when things go wrong in production. A good incident management process allows you to follow these necessary goals, commonly referred to as the three Cs:

Coordinate the response
Communicate between the incident participants, others in the organization, and interested parties in the outside world
Maintain control over the incident response

Google identified necessary elements to their incident command system in the Managing Incidents chapter written by Andrew Stribblehill in Site Reliability Engineering: How Google Runs Production Systems. These elements include the following:

Clearly defined incident management roles
A (virtual or physical) command post
A living incident state document
Clear handoffs to others

Let’s look at these elements in detail.

Incident management roles

Upon recognition that what you are facing is truly an incident, a team should be assembled to work on the problem and share information until the incident is resolved. The team will have roles so that coordination is properly maintained. Let’s look at these roles in detail.

The incident commander

The incident commander may start as the person who originally reports the incident. The incident commander embodies the three Cs by delegating the necessary roles to others. Any role not delegated is assumed to belong to the incident commander.

The incident commander will work with the other responders to resolve the incident. If there are any roadblocks, the incident commander will facilitate their removal.

Operations lead

The operations lead will work together with the incident commander. They will run any needed tools to resolve the incident. The operations lead is the only person allowed to make changes to the system.

Communications lead

The communications lead is the public face of the incident and its response. They are responsible for communication with outside groups and stakeholders. They may also ensure that the incident document is kept up to date.

Incident planning/logistics

Planning and logistics will work with the operations people by working on longer-term issues of the incident such as arranging for handoffs between roles, ordering meals, and entering tickets in the bug tracking system. They will also track how the system has diverged from the norm so that it can be returned to normal when the incident is resolved.

The incident command post

A war room is needed for all members of the incident response team to convene and collaborate on the solution. This place should be where outside parties can meet with the incident commander and other members of the incident response team.

Because of distributed development, these command posts are typically virtual as opposed to a physical room. Internet Relay Chat (IRC) chat rooms or Slack channels can serve as the medium for gathering in one spot.

The incident state document

The incident commander’s main responsibility is to record all activity and information related to the incident in the incident state document. This is a living document, meant to be frequently updated. A wiki may suffice, but that typically allows only one person to edit it at a time.

Suitable alternatives may be a Confluence page or a document shared in a public Google Drive or Microsoft SharePoint folder.

Google maintains a template for an incident state document that can be used as a starting point.

Setting up clear handoffs

As we saw with Atlassian’s incident earlier in this chapter, incidents can stretch over several days or even weeks. So, the handoff of roles is essential, particularly for the incident commander. Communication must be clear to everyone that a handoff has taken place to minimize any confusion.

While in an incident, some actions that may help move toward a solution include rolling back or rolling forward. These may work if the root cause is diagnosed as a new change recently made. We’ll look at these alternatives in our next section.

Perseverance – rolling back or fixing forward

If the reason for the production failure is a new change, a quick resolution may involve reverting to the state of the system before the change, or if a fix is found, immediately running it through the CI/CD pipeline to immediately deploy.

Some of the methods for rolling back or rolling forward a fix include the following ones. Let’s examine them in detail.

Rolling back with blue/green deployment

A blue/green deployment makes use of two production environments: one live and the other on standby. The live environment is the one that customers use, while the standby environment is there as a backup. The change is made on the standby environment and then the standby environment is made live. You can see an illustration of this type of deployment here:

Figure 6.1 – Blue/green deployment: environment switch

As the preceding diagram indicates, both environments are still present, but only one has access to customer traffic. The arrangement remains this way until changes are deployed into the standby environment or a rollback becomes necessary, as illustrated in the following diagram:

Figure 6.2 – Blue/green deployment: rollback

A blue/green deployment works well when the environment is stateless—that is, there is no need to consider the state of data. Complications arise when the data’s state has to be considered in artifacts such as databases or volatile storage.

Rolling back with feature flags

We saw in Chapter 3, Automation for Efficiency and Quality, that feature flags allowed the propagation of changes to deployment without the change visible until the flag was toggled on. In this same way, if a new feature is the root cause of a production failure, the flag can be toggled off until the new feature can be fixed, as illustrated in the following diagram:

Figure 6.3 – Rollback with a feature flag

Rolling back by using feature flags allows for a quick change to previous behavior without extensive changes to the source code or configuration.

Rolling forward using the CI/CD pipeline

Rolling forward or fixing forward is the method of resolving an incident by developing a fix for the error, allowing it to go through the CI/CD pipeline so that it can be deployed into production. It can be an effective way to resolve the incident, especially if the change is small.

Fixing forward should be viewed as a last resort. If fixing forward is the only viable option, your product’s architecture may be too tightly coupled with dependent components. For example, if the new release depended on a change to the database schema and customer data was already stored in the new tables before the production failure was discovered, there may be no rollback without losing the customer data.

Changes that are intended to be released in a roll-forward solution should undergo the same process as normal releases. Quick fixes that may not follow the entire process, which may not have the same scrutiny and test coverage, may increase the technical debt by introducing errors in other parts of the system.

Summary

We examined in this chapter what happens when things go wrong in production. We began our chapter by looking at two incidents: the initial release of healthcare.gov in 2013 and the Atlassian cloud outage in 2022. We learned from both incidents the importance of prevention and planning for future incidents.

We then explored methods of preparation by looking at important parts of the discipline of SRE. SRE begins this process by setting the SLIs and SLOs so that we have an idea of the tolerance of risk through the error budget. SRE also looks at the process of releasing new changes and launching new products.

We looked at practicing for disaster through the discipline of chaos engineering. We understood the principles behind the discipline and how to create experiments through the Disasterpiece Theater process.

Ultimately, even with adequate preparation, production failures will still happen. We looked at the key parts of Google’s incident management process for incident management and techniques for resolving incidents, such as rolling back or fixing forward.

With this, we have completed Part 1: Approach – A Look at SAFe® and DevOps through CALMR. We will now look at a key activity of DevOps, value stream management, in Part 2: Implement – Moving Toward Value Streams.

Questions

Test your knowledge of the concepts in this chapter by answering these questions.

What is an example of an SLI?
1. Velocity
2. Availability
3. Cycle time
4. Scalability
If your organization sets up an SLO of 99% availability on a monthly basis, what is your error budget if the acceptable downtime is 7.2 hours/month?
1. 0.072 hours/month
2. 0.72 hours/month
3. 7.2 hours/month
4. 72 hours/month
Which is NOT a principle of release engineering?
1. Self-service model
2. High velocity
3. Dependent builds
4. Enforcement of policy/procedures
Which company created Chaos Monkey and the Simian Army?
1. Netflix
2. Google
3. Amazon
4. Apple
Which of these are chaos engineering principles (choose two)?
1. Decentralized decision making
2. Organize around value
3. Minimize the experiment’s fallout
4. Run the experiment in production
5. Apply systems thinking
Which role in Google’s incident command system is the primary author of the incident state document?
1. Operations lead
2. Incident commander
3. Communications lead
4. Planning/logistics

Table of Contents for Chapter 6: Recovering from Production Failures

Create new playlist

Sign In

Sign Up

6

Recovering from Production Failures