Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Part I. Introduction

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Principles of Chaos

If you’ve ever run a distributed system in production, you know that unpredictable events are bound to happen. Distributed systems contain so many interacting components that the number of things that can go wrong is enormous. Hard disks can fail, the network can go down, a sudden surge in customer traffic can overload a functional component—the list goes on. All too often, these events trigger outages, poor performance, and other undesirable behaviors.

We’ll never be able to prevent all possible failure modes, but we can identify many of the weaknesses in our system before they are triggered by these events. When we do, we can fix them, preventing those future outages from ever happening. We can make the system more resilient and build confidence in it.

Chaos Engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. This empirical process of verification leads to more resilient systems, and builds confidence in the operational behavior of those systems.

Using Chaos Engineering may be as simple as manually running kill -9 on a box inside of your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production enviroment against a small but statistically significant fraction of live traffic.

The History of Chaos Engineering at Netflix

Ever since Netflix began moving out of a datacenter into the cloud in 2008, we have been practicing some form of resiliency testing in production. Only later did our take on it become known as Chaos Engineering. Chaos Monkey started the ball rolling, gaining notoriety for turning off services in the production environment. Chaos Kong transferred those benefits from the small scale to the very large. A tool called Failure Injection Testing (FIT) laid the foundation for tackling the space in between. Principles of Chaos helped formalize the discipline, and our Chaos Automation Platform is fulfilling the potential of running chaos experimentation across the microservice architecture 24/7.

As we developed these tools and experience, we realized that Chaos Engineering isn’t about causing disruptions in a service. Sure, breaking stuff is easy, but it’s not always productive. Chaos Engineering is about surfacing the chaos already inherent in a complex system. Better comprehension of systemic effects leads to better engineering in distributed systems, which improves resiliency.

This book explains the main concepts of Chaos Engineering, and how you can apply these concepts in your organization. While the tools that we have written may be specific to Netflix’s environment, we believe the principles are widely applicable to other contexts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for I. Introduction

Create new playlist

Sign In

Sign Up

Part I. Introduction

Table of Contents for
I. Introduction