Chapter 2. Overview of Principles

In the early days of “Chaos Engineering” at Netflix, it was not obvious what the discipline actually was. There were some catchphrases about pulling out wires or breaking things or testing in production, many misconceptions about how to make services reliable, and very few examples of actual tools. The Chaos Team was formed to create a meaningful discipline, one that proactively improved reliability through tooling. We spent months researching Resilience Engineering and other disciplines in order to come up with a definition and a blueprint for how others could also participate in Chaos Engineering. That definition was put online as sort of a manifesto, referred to as The Principles. (See “Appendix A: Birth of Chaos” for the story of how Chaos Engineering came about.)

As is the case with any new concept, Chaos Engineering is sometimes misunderstood. The following sections of this chapter explore what the discipline is, and what it is not. The gold standard for the practice is captured in the section “Advanced Principles.” Finally, we take a look at what factors could change the principles going forward.

What Chaos Engineering Is

The Principles defines the discipline so that we know when we are doing Chaos Engineering, how to do it, and how to do it well. The common definition today for Chaos Engineering is: The facilitation of experiments to uncover systemic weaknesses. The Principles outlines the steps of the experimentation as follows.

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

Hypothesize that this steady state will continue in both the control group and the experimental group.

Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.”1

This experimentation constitutes the basic principles of Chaos Engineering. By design, there is great latitude in how to implement these experiments.

Experimentation vs Testing

One of the first distinctions we found necessary to make at Netflix is that Chaos Engineering is a form of experimentation, not testing. Arguably both fall under the umbrella of “Quality Assurance,” but that phrase often has negative connotations in the software industry.

Other teams at Netflix would initially ask the Chaos Team something along the lines of, “Can’t you just write a bunch of integration tests that look for the same thing?” This sentiment was pragmatic in theory, but in practice it was impossible to get the desired result from integration tests.

Testing, strictly speaking, does not create new knowledge. Testing requires that the engineer writing the test knows specific properties about the system that they are looking for. As illustrated in the previous chapter, complex systems are opaque to that type of analysis. Humans are simply not capable of understanding all of the potential side effects from all of the potential interactions of parts in a complex system. This leads us to one of the key properties of a test.

Tests make an assertion, based on existing knowledge, and then running the test collapse the valence of that assertion, usually into either true or false. Tests are statements about known properties of the system.

Experimentation on the other hand creates new knowledge. Experiments propose a hypothesis, and as long as the hypothesis is not disproven, confidence grows in that hypothesis. If it is disproven, then we learn something new. This kicks off an inquiry to figure out why our hypothesis is wrong. In a complex system, the reason why something happens is often not obvious. Experimentation either builds confidence, or it teaches us new properties about our own system. It is an exploration of the unknown.

No amount of testing in practice can equal the insight gained from experimentation, because testing requires a human to come up with the assertions ahead of time. Experimentation formally introduces a way to discover new properties. It is entirely possible to translate newly discovered properties of a system into tests after they are discovered. It also helps to encode new assumptions about a system into new hypotheses, which creates something like a ‘regression experiment’ that explores system changes over time.

Because Chaos Engineering was born from complex system problems, it is essential that the discipline embody experimentation over testing. The four steps of experimentation distilled in The Principles roughly adheres to commonly accepted definitions, from the point of view of exploring availability vulnerabilities in systems at scale.

A combination of real-world experience applying the steps above to systems at scale as well as thoughtful introspection led the Chaos Team to push the practice further than just experimentation. These insights became the “Advanced Principles” which guide teams through the maturity of their Chaos Engineering programs and help set a gold standard toward which we can aspire.

Verification vs Validation

Using definitions of verification and validation inspired by operations management and logistical planning, we can say that Chaos Engineering is strongly biased toward the former over the latter.

Verification of a complex system is a process of analyzing output at a system boundary. A homeowner can verify the quality of the water (output) coming from a sink (system boundary) by testing it for contaminants without knowing anything about how plumbing or municipal water service (system parts) functions.

Validation of a complex system is a process of analyzing the parts of the system and building mental models that reflect the interaction of those parts. A homeowner can validate the quality of water by inspecting all of the pipes and infrastructure (system parts) involved in capturing, cleaning, and delivering water (mental model of functional parts) to a residential area and eventually to the house in question.

Both of these practices are potentially useful, and both build confidence in the output of the system. As software engineers we often feel a compulsion to dive into code and validate that it reflects our mental model of how it should be working. Contrary to this predilection, Chaos Engineering strongly prefers the Verification over Validation. Chaos Engineering cares whether something works, not how.

Note that in the plumbing metaphor we could validate all of the components that go into supplying clean drinking water, and yet still end up with contaminated water for some reason we did not expect. In a complex system, there are always unpredictable interactions. But if we verify that the water is clean at the tap, then we do not necessarily have to care about how it got there. In most business cases, the output of the system is much more important than whether or not the implementation matches our mental model. Chaos Engineering cares more about the business case and output than about the implementation or mental model of interacting parts.

What Chaos Engineering Is Not

Now let’s review what Chaos Engineering is not.

Breaking Stuff

Occasionally in blog posts or conference presentations we hear Chaos Engineering described as, “breaking stuff in production.” While this might sound cool, it doesn’t appeal to enterprises running at scale and other complex system operators who can most benefit from the practice. A better characterization of Chaos Engineering would be: Fixing stuff in production. “Breaking stuff” is easy; the difficult parts are around mitigating blast radius, thinking critically about safety, determining if something is worth fixing, determining whether you should invest in experimenting on it, the list goes on. “Breaking stuff” could be done in countless ways, with little time invested. The larger question here is: How do we reason about things that are already broken, when we don’t even know they are broken?

“Fixing stuff in production” does a much better job of capturing the value of Chaos Engineering since the point of the whole practice is to proactively improve availability and security of a complex system. Plenty of disciplines and tools already exist to reactively respond to an incident: alerting tools, incident response management, observability tools, disaster recovery planning, etc. These aim to reduce time-to-detect and time-to-resolve after the inevitable incident. An argument could be made that Site Reliability Engineering (SRE) straddles both reactive and proactive disciplines by generating knowledge from past incidents and socializing that to prevent future ones. Chaos Engineering is the only major discipline in software that focuses solely on proactively improving safety in complex systems.

Anti-fragility

People familiar with the concept of Antifragility, introduced by Nassim Taleb,2 often assume that Chaos Engineering is essentially the software version of the same thing. Taleb argues that words like “hormesis” are insufficient to capture the ability of complex systems to adapt, and so he invented the word “antifragile” as a way to refer to systems that get stronger when exposed to random stress. An important, critical distinction between Chaos Engineering and Antifragility is that Chaos Engineering educates human operators about the chaos already inherent in the system, so that they can be a more resilient team. Antifragility by contrast adds chaos to a system in hopes that it will grow stronger in response rather than succumbing to it.

As a framework, Antifragility puts forth guidance at odds with the scholarship of Resilience Engineering, Human Factors, and Safety Systems research. For example, Antifragility proposes that the first step in improving a system’s robustness is to hunt for weaknesses and remove them. This proposal seems intuitive but Resilience Engineering tells us that hunting for what goes right in safety is much more informative than what goes wrong. Step two in Antifragility is to add redundancy. This also seems intuitive, but adding redundancy can cause failure just as easily as it can mitigate against it, and the literature in Resilience Engineering is rife with examples where redundancy actually contributes to safety failures.3

There are numerous other examples of divergence between these two schools of thought. Resilience Engineering is an ongoing area of research with decades of support, whereas Antifragile is a theory that exists largely outside of academia and peer review. It is easy to imagine how the two concepts become conflated, since both deal with chaos and complex systems, but the spirit of Antifragile does not share the empiricism and fundamental grounding of Chaos Engineering. For these reasons we should consider them to be fundamentally different pursuits.4

Advanced Principles

Chaos Engineering is grounded in empiricism, experimentation over testing, and verification over validation: but not all experimentation is equally valuable. The gold standard for the practice was first captured in the “Advanced Principles” section of The Principles. The advanced principles are:

  • Build a Hypothesis around Steady State Behavior
  • Vary Real-world Events
  • Run Experiments in Production
  • Automate Experiments to Run Continuously
  • Minimize Blast Radius

Build a Hypothesis around Steady State Behavior

Every experiment begins with a hypothesis. For availability experiments, the form of the experiment is usually:

Under ______ circumstances, customers still have a good time.

For security experiments by contrast, the form of the experiment is usually:
Under ______ circumstances, the security team is notified.

In both cases, the blank space is filled in by the variables mentioned in the next section.

The Advanced Principles emphasize building the hypothesis around a steady state definition. This means focusing on the way the system is expected to behave, and capturing that in a measurement. In the above examples, customers presumably have a good time by default, and security usually gets notified when something violates a security control.

This focus on steady state forces engineers to step back from the code and focus on the holistic output. It captures Chaos Engineering’s bias toward verification over validation. We often have an urge to dive into a problem, find the ‘root cause’ of a behavior, and try to understand a system via reductionism. Doing a deep dive can help with exploration, but it is a distraction from the best learning that Chaos Engineering can offer. At its best, Chaos Engineering is focused on key performance indicators (KPI) or other metrics that track with clear business priorities, and those make for the best steady state definitions.

Vary Real-world Events

This Advanced Principle states that the variables in experiments should reflect real-world events. While this might seem obvious in hindsight, there are two good reasons for explicitly calling this out:

Variables are often chosen for what is easy to do rather than what provides the most learning value.

Engineers have a tendency to focus on variables that reflect their experience rather than the users’ experience.

Avoid Choosing the Easy Route

Chaos Monkey5 is actually pretty trivial for such a powerful program. It’s an open source product that randomly turns off instances [virtual machines, containers, or servers] about once a day for each service. You can use it as-is, but the same functionality can be provided by a bash script at most organizations. This is basically the low-hanging fruit of Chaos Engineering. Cloud deployment and now container deployment ensure that systems at scale will have instances (virtual machines or containers) spontaneously disappear on a somewhat regular basis. Chaos Monkey replicates that variable and simply accelerates the frequency of the event.

This is useful and fairly easy to do if you have root-level privileges to the infrastructure so that you can make those instances disappear. Once you have root-level privileges, the temptation exists to do other things that are easy to do as root. Consider the following variables introduced on an instance:

  • Terminate an instance
  • Peg the CPU of an instance
  • Utilize all available memory on an instance
  • Fill up the disk of an instance
  • Turn off networking on the instance

These experiments all have one thing in common: They will predictably cause the instance to stop responding. From a system perspective, this looks the same as if you terminated the instance. You essentially learn nothing from the last four experiments that you have not already learned from the first. These experiments are essentially a waste of time.

From the distributed system perspective, almost all interesting availability experiments can be driven by affecting latency or response type. Terminating an instance is a special case of infinite latency. In most online systems today response type is often synonymous with status code, like changing HTTP 200s to 500s. It follows that most availability experiments can be constructed with a mechanism to vary latency and change status codes.

Varying latency is much more difficult to do than simply pegging a CPU or filling up the RAM on an instance. It requires coordinated involvement with all relevant interprocess communication (IPC) layers. That might mean modifying sidecars, software-defined networking rules, client-side library wrappers, service meshes, or even lightweight load balancers. Any of these solutions requires a non-trivial engineering investment.

Focus on the Users’ Experience

There is a lot to be said for improving the experience for developers. DevUX is an underappreciated discipline. Concerted effort to improve the experience of software engineers as they write, maintain, and deliver code to production, and revert those decisions, has huge long-term payoff. That said, most business value from Chaos Engineering is going to come from finding vulnerabilities in the production system, not in the development process. Therefore, it makes sense to instead focus on variables that might impact the user experience.

Since software engineers are usually choosing the variables in chaos experiments and not users, this focus is sometimes lacking. An example of this misguided focus is the enthusiasm on the part of chaos engineers for introducing data corruption experiments. There are many places where data corruption experiments are warranted and highly valuable. Verifying the guarantees of databases is a clear example of this. Corruption of response payload in-transit to the client is probably not a good example.

Consider the increasingly common experiments whereby response payload is corrupted to return mis-formed HTML or broken JSON. This variable isn’t likely to happen in the real world, and if it does, it’s likely to happen on per-request basis that is easily accommodated by user behavior (a retry event), by fallbacks (a different kind of retry event), or be graceful clients (like web browsers.)

As engineers, we may run into contract mismatches frequently. We see that libraries interacting with our code behaves in ways that we don’t want. We spend a lot of time adjusting our interaction to get the behavior we want from those libraries. Since we’ve seen a lot of behavior we didn’t want, we then assume that it’s worthwhile to build experiments that expose those cases. This assumption is false; Chaos experimentation isn’t necessary in these cases. Negotiating contract mismatches is part of the development process. It is discovery for the engineer. Once the code is working, once the contract is figured out, it’s extremely unlikely that a cosmic ray or fried transistor is going to garble the library output and corrupt the data in transit. Even if that is something we want to address, because of library drift or something to that effect, it is a known property of the system. Known properties are best addressed with testing.

Chaos Engineering hasn’t been around long enough to formalize the methods used to generate variables. Some methods are obvious, like introducing latency. Some require analysis, like adding the right amount of latency to induce queuing effects without actually surpassing alert limits or an SLO. Some are highly context-sensitive, like degrading the performance of a second-tier service in such a way that it causes a different second-tier service to temporarily become a first-tier service.

As the discipline evolves, we expect these methods to either deliver formal models for generating variables, or at least have default models that capture generalized experience across the industry. In the meantime, avoid choosing the easy route, focus on the users’ experience, and vary real-world events.

Run Experiments in Production

Experimentation teaches you about the system you are studying. If you are experimenting on a Staging environment, then you are building confidence in that environment. To the extent that a Staging and Production environment differ, often in ways that a human cannot predict, you are not building confidence in the environment that you really care about. For this reason, the most advanced Chaos Engineering takes place in Production.

This principle is not without controversy. Certainly in some fields there are regulatory requirements that preclude the possibility of affecting the Production systems. In some situations there are insurmountable technical barriers to running experiments in Production. It is important to remember that the point of Chaos Engineering is to uncover the chaos inherent in complex systems, not to cause it. If we know that an experiment is going to generate an undesirable outcome, then we should not run that experiment. This is especially important guidance in a Production environment where the repercussions of a disproved hypothesis can be high.

As an advanced principle, there is no all-or-nothing value proposition to running experiments in Production. In most situations, it makes sense to start experimenting on a Staging system, and gradually move over to Production once the kinks of the tooling are worked out. In many cases, critical insights into Production are discovered first by Chaos Engineering on Staging.

Automate Experiments to Run Continuously

This principle recognizes a practical implication of working on complex systems. Automation has to be brought in for two reasons:

To cover a larger set of experiments than humans can cover manually. In complex systems, the conditions that could possibly contribute to an incident are so numerous that they can’t be planned for. In fact, they can’t even be counted because they are unknowable in advance. This means that humans can’t reliable search the solution space of possible contributing factors in a reasonable amount of time. Automation provides a means to scale out the search for vulnerabilities that could contribute to undesirable systemic outcomes.

To empirically verify our assumptions over time, as unknown parts of the system are changed. Imagine a system where the functionality of a given component relies on other components outside of the scope of the primary operators. This is the case in almost all complex systems. Without tight coupling between the given functionality and all the dependencies, it is entirely possible that one of the dependencies will change in such a way that it creates a vulnerability. Continuous experimentation provided by automation can catch these issues and teach the primary operators about how the operation of their own system is changing over time. This could be a change in performance [example: the network is becoming saturated by noisy neighbors] or a change in functionality [example: the response bodies of downstream services are including extra information that could impact how they are parsed] or a change in human expectations [example: the original engineers leave the team, and the new operators are not as familiar with the code.]

Automation itself can have unintended consequences. The chapters ‘People in the Loop” and “Experiment Selection Problem (and a Solution)” in Part Three of this book explore some of the pros and cons of automation. The Advanced Principles maintain that automation is an advanced mechanism to explore the solution space of potential vulnerabilities, and to reify institutional knowledge about vulnerabilities by verifying hypothesis over time knowing that complex systems will change.

Minimize Blast Radius

The final advanced principle, “Minimize Blast Radius,” was added to The Principles after the Chaos Team at Netflix found that they could significantly reduce the risk to Production traffic by engineering safer ways to run experiments. By using a tightly orchestrated control group to compare with a variable group, experiments can be constructed in such a way that the impact of a disproved hypothesis on customer traffic in Production is minimal.

How a team goes about achieving this is highly context-sensitive to the complex system at hand. In some systems it may mean using shadow traffic; or excluding requests that have high business impact like transactions over $100; or implementing automated retry logic for requests in the experiment that fail. In the case of the Chaos Team’s work at Netflix, sampling of requests, sticky sessions, and similar functions were added into the Chaos Automation Platform6 (ChAP) which is discussed more in the chapter “Continuous Verification” in Part Five of this book. These techniques not only limited the blast radius; they had the added benefit of strengthening signal detection, since the metrics of a small variable group can often stand out starkly in contrast to a small control group. However it is achieved, this advanced principal emphasis that in truly sophisticated implementations of Chaos Engineering, the potential impact of an experiment can be limited by design.

All of these advanced principles are presented to guide and inspire, not to dictate. They are born of pragmatism and should be adopted (or not) with that aim in mind.

The Future of The Principles

In the five years since The Principles was published, we have seen Chaos Engineering evolve to meet new challenges in new industries. The principles and foundation of the practice are sure to continue to evolve as adoption expands through the software industry and into new verticals.

When Netflix first started evangelizing Chaos Engineering at Chaos Community Day in earnest in 20157, they received a lot of pushback from financial institutions in particular. The common concern was, “Sure, maybe this works for an entertainment service or online advertising, but we have real money on the line.” To which the Chaos Team responded, “Do you have outages?”

Of course, the answer is yes; even the best engineering teams suffer outages at high-stakes financial institutions. This left two options, according to the Chaos Team: Either, 1) continue having outages at some unpredictable rate and severity, or 2) adopt a proactive strategy like Chaos Engineering to understand risks in order to prevent large, uncontrolled outcomes. Financial institutions agreed, and many of the world’s largest banks now have dedicated Chaos Engineering programs.

The next industry to voice concerns with the concept was healthcare. The concern was expressed as, “Sure, maybe this works for online entertainment or financial services, but we have human lives on the line.” Again, the Chaos Team responded, “Do you have outages?”

But in this case, even more direct appeal can be made to the basis of healthcare as a system. When empirical experimentation was chosen as the basis of Chaos Engineering, it was a direct appeal to Karl Popper’s concept of falsifiability8, which provides the foundation for Western notions of science and the scientific method. The pinnacle of Popperian notions in practice is the clinical trial.

In this sense, the phenomenal success of the Western healthcare system is built on Chaos Engineering. Modern medicine depends on double-blind experiments with human lives on the line. They just call it by a different name: the clinical trial.

Forms of Chaos Engineering have implicitly existed in many other industries for a long time. Bringing experimentation to the forefront, particularly in the software practices within other industries, gives power to the practice. Calling these out and explicitly naming it Chaos Engineering allows us to strategize about its purpose and application, and take lessons learned from other fields and apply them to our own.

In that spirit, we can explore Chaos Engineering in industries and companies that look very different from the prototypical microservice-at-scale examples commonly associated with the practice. FinTech, Autonomous Vehicles (AV), and Adversarial Machine Learning can teach us about the potential and the limitations of Chaos Engineering. Mechanical Engineering and Aviation expand our understanding even further, taking us outside the realm of software into hardware and physical prototyping. Chaos Engineering has even expanded beyond Availability into Security, which is the other side of the coin from a system safety perspective. All of these new industries, use cases, and environments will continue to evolve the foundation and principles of Chaos Engineering.

1 https://principlesofchaos.org/

2 “Antifragile: Things that Gain from Disorder,” by Nassim Taleb, ISBN 9780812979688

3 Perhaps the most famous example is the Challenger explosion in 1986. The redundancy of o-rings was one of three reasons that NASA approved the continuation of the launches, even though damage to the primary o-ring was well known internally for over fifty of the prior launch missions over the span of five years. See Diane Vaughan’s book The Challenger Launch Decision.

4 https://www.linkedin.com/pulse/antifragility-fragile-concept-casey-rosenthal/

5 Chaos Monkey was the germinal Chaos Engineering tool. See Appendix A for more history on how Chaos Engineering came about.

6 https://medium.com/netflix-techblog/chap-chaos-automation-platform-53e6d528371f

7 See the Appendix chapter “Birth of Chaos” for more about Chaos Community Day and early evangelism.

8 https://en.wikipedia.org/wiki/Falsifiability

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.180.244