Chapter 6. Chaos Engineering from Beginning to End

In Chapter 5 you worked through an entire cycle, from discovery of system weaknesses to overcoming them using a preprepared automated chaos experiment. The Chaos Toolkit’s experiment definition format is designed for this sort of sharing and reuse (see Chapter 7), but in this chapter you’re going to build an experiment from first principles so that you can really experience the whole journey.

To make the challenge just a little more real, the weakness you’re going to explore and discover against the target system in this chapter is multilevel in nature.

In Chapter 1 I introduced the many different areas of attack on resiliency, namely:

  • People, practices, and processes

  • Applications

  • Platform

  • Infrastructure

The experiment you’re going to create and run in this chapter will look for weaknesses in both the platform and infrastructure areas, and even in the people area.

The Target System

As this experiment is going to examine weaknesses at the people level, you’ll need more than a simple technical description of the target system. You’ll still start, though, by enumerating the technical aspects of the system; then you’ll consider the people, processes, and practices that will also be explored for weaknesses as part of the whole sociotechnical system.

The Platform: A Three-Node Kubernetes Cluster

Technically, the target system is made up of a Kubernetes cluster that is once again running nothing more than a simple service.

Kubernetes provides the platform, with Kubernetes nodes providing the lower-level infrastructure resources that support and run containers and services in the cluster. In this case, the target system has a three-node cluster topography.

Want More on Kubernetes?

A full description of all the Kubernetes concepts in play in a typical cluster is beyond the scope of this book. Check out the excellent, forthcoming Kubernetes: Up and Running 2nd Edition by Kelsey Hightower et al. (O’Reilly) for a deeper dive into the platform.

So, for the purposes of your chaos experimentation in this chapter, there is going to be a service running on one or more replicated containers across a cluster made up of three nodes. Now it’s time to describe and deploy that service to the cluster.

The Application: A Single Service, Replicated Three Times

Once again, you don’t need a supercomplicated application made up of hundreds of services to have a system that is sufficiently complicated for weaknesses to be found. The application deployed onto your target system is going to be made up of a single service, defined as follows:

import platform

import cherrypy


class Root:
    @cherrypy.expose
    def index(self) -> str:
        return "Hello world from {}".format(platform.node())


if __name__ == "__main__":
    cherrypy.config.update({
        "server.socket_host": "0.0.0.0",
        "server.socket_port": 8080
    })
    cherrypy.quickstart(Root())

This code is available in the chapter4/before directory in the sample code for this book that you grabbed earlier (see “Setting Up the Sample Target System”). Accompanying this service description is a description of the deployment itself:

{
    "apiVersion" : "apps/v1beta1",
    "kind" : "Deployment",
    "metadata" : {
      "name" : "my-service"
    },
    "spec" : {
      "replicas" : 3,
      "selector" : {
        "matchLabels" : {
          "service" : "my-service"
        }
      },
      "template" : {
        "metadata" : {
          "name" : "my-app",
          "labels" : {
            "name" : "my-app",
            "service" : "my-service",
            "biz-app-id" : "retail"
          }
        },
        "spec" : {
          "containers" : [ {
            "name" : "my-app",
            "ports" : [ {
              "name" : "http",
              "containerPort" : 8080,
              "protocol" : "TCP"
            } ],
            "imagePullPolicy" : "Always",
            "image" : "docker.io/chaosiq/sampleservice:0.1.0",
            "resources" : {
              "limits" : {
                "cpu" : 0.1,
                "memory" : "64Mi"
              },
              "requests" : {
                "cpu" : 0.1,
                "memory" : "64Mi"
              }
            }
          } ]
        }
      },
      "strategy" : {
        "type" : "RollingUpdate",
        "rollingUpdate" : {
          "maxUnavailable" : 1,
          "maxSurge" : 1
        }
      }
    }
  }

The main thing to notice is the replicas directive in the deployment.json file. This directive specifies that the team that developed this service expects it to be run with three instances to provide a minimal fallback strategy should one or more of the replicas have a problem.

When you deploy these specifications, using a command such as kubectl apply -f before/, Kubernetes will establish a cluster where my-service will be replicated across the available nodes.

All good so far, but often there is more than one group of people involved in a system such as this. The service and deployment specifications represent the intent of the team responsible for the application itself, but there’s typically another group responsible for the cluster itself, and that can lead to complications and turbulent conditions.

The People: Application Team and Cluster Administrators

In a system like this, there is often a dividing line between those who are responsible for managing a Kubernetes cluster, including all the real costs associated with these resources (let’s call them the Cluster Administrators), and those trying to deploy and manage applications upon the cluster (the Application Team). The Cluster Administrators will be interested in tasks such as:

  • Managing nodes

  • Managing persistent disks

  • Keeping an eye on the costs if these resources are hosted on a cloud provider

  • Ensuring the cluster is not the problem, so they can go home for the weekend

The Application Team will be more focused on:

  • Ensuring their application and its constituent services and third-party dependencies are healthy

  • Ensuring they have enough redundancy and capacity across the Kubernetes pods being used to run their containers

  • Ensuring they claim the persistent storage they need for their application to maintain its state

  • Ensuring they can go home at 5:30 p.m. (other times are available)

The Application Team is responsible for making sure the deployment.json and service.json specifications ask for what they need from the platform. The Cluster Administrators will have their own dashboards and tools to ensure the underlying resources of the cluster are able to meet the needs expressed by the Application Team (see Figure 6-1).

An image of the cluster administrators and application teams working on a kubernetes cluster.
Figure 6-1. Application Team and Cluster Administrators working on a Kubernetes cluster

Hunting for a Weakness

At this point everything looks fine. The Application Team has defined what it needs, and the Cluster Administrators have provisioned enough resources to meet those needs. But you’re a chaos engineer, so you’re going to gather some empirical evidence to back up the trust and confidence everyone would like to have in this system.

You get the teams together to think about potential weaknesses in this system that could affect the reliability of the system as a whole. During this brainstorming session, you collaboratively notice a scenario that seems common enough, but no one is sure how it will play out on the production cluster.

The scenario is related to the split in responsibilities between the Cluster Administrators and the Application Team. During a regular working day, a Cluster Administrator may execute operations that cause issues with the goals of the Application Team. The situation that could possibly cause problems is one in which a Cluster Administrator takes action to remove a node from the cluster, for routine maintenance perhaps—this common action may leave the application in a state in which it cannot get access to the resources it needs (in this case, the required number of pods).

Through no fault of their own, the individual goals of the Cluster Administrators and the Application Team could end up in conflict, which could lead to system outages.

The challenge is that this scenario could be a problem, but you simply don’t know for sure, and you can’t know for sure until you test this case against the real system. It’s time to kick off the explore phase of the Chaos Engineering Learning Loop so that you can try and discover whether there really is a weakness present.

For this automated chaos experiment, you will need to:

  • Name your experiment

  • Declare how you will know the system has not deviated from “normal” using a steady-state hypothesis

  • Declare the turbulent conditions you want to apply to attempt to surface a weakness

  • Declare any remediating actions you want to execute as rollbacks when your experiment is finished

Let’s get started!

Naming Your Experiment

Create a new experiment.json file in your chapter6 directory and start your experiment definition by considering how it should be named (I name chaos experiments according to the high-level question being explored). Then complete your title and description section with some tags:

{
    "version": "1.0.0",
    "title": "My application is resilient to admin-instigated node drainage", 1
    "description": "Can my application maintain its minimum resources?", 2
    "tags": [
        "service",
        "kubernetes" 3
    ],
1

This is the statement of belief. You believe the system will be OK under the turbulent conditions explored in this experiment.

2

The description gives you an opportunity to describe things in a little more detail, sometimes raising the doubt that’s causing the experiment in the first place.

3

Since we know we are targeting a Kubernetes-based system, it’s helpful to others who might read and use our experiment to tag it as being applicable to that platform.

Defining Your Steady-State Hypothesis

What does “normal” look like? How should the system look so that you know it’s operating within particular bounds? Everything may not be working, but the steady-state hypothesis isn’t concerned with the health of the individual parts of a system; its main concern is that the system is still able to operate within declared tolerances, regardless of the turbulent conditions being injected through your experiment’s method.

As we saw in Chapter 5, a steady-state hypothesis is made up of one or more probes and associated tolerances. Each probe will look for a property within your target system and judge whether that property’s value is within the tolerance specified.

If all the probes declared in your steady-state hypothesis are within tolerance, then your system can be recognized as not in a “deviated” state. Define your steady-state hypothesis as the next section in your experiment.json file, as shown here:

"steady-state-hypothesis": {
    "title": "Services are all available and healthy", 1
    "probes": [
        {
            "type": "probe",
            "name": "application-must-respond-normally",
            "tolerance": 200, 2
            "provider": {
                "type": "http", 3
                "url": "http://http://35.189.85.252/", 4
                "timeout": 3 5
            }
        },
        {
            "type": "probe",
            "name": "pods_in_phase",
            "tolerance": true,
            "provider": {
                "type": "python", 6
                "module": "chaosk8s.pod.probes", 7
                "func": "pods_in_phase", 8
                "arguments": { 9
                    "label_selector": "biz-app-id=retail",
                    "phase": "Running",
                    "ns": "default"
                }
            }
        }
    ]
},
1

The title of your hypothesis should describe your belief in the normal condition of the system.

2

The tolerance in this case expects the probe to return with a 200 status code.

3

The first probe uses HTTP to assess the return value of an endpoint.

4

You will need to change this value to the endpoint of the service as it is exposed from your Kubernetes cluster.

5

A reasonable timeout is a good idea to ensure your experiment doesn’t wait too long to decide that the system is normal.

6

The second probe uses a call to a Python module.

7

This is the name of the Python module to be used by this probe.

8

This is the name of the Python function to be used by this probe.

9

This is a key/value list of the arguments to be supplied to the probe’s Python function.

This steady-state hypothesis shows you how a simple HTTP call can be used as a probe, or, if more complicated processing is necessary, how a Python module’s function can be used instead. A third option that is not shown here is to use a call to a local process as a probe.

The Steady-State Is Used Twice

Don’t forget: your steady state is used twice when an experiment is executed (see Figure 5-6): once at the beginning, to judge whether the target system is in a known and recognizably “normal” state in which to conduct your experiment; and once at the end, to detect whether there has been any observable deviation from what is known and recognizably “normal,” which might indicate a system weakness.

In summary, your steady-state hypothesis for this experiment probes the target system to detect (a) that it is responding with an HTTP 200 status code within three seconds on the main entry point to the system and (b) that Kubernetes is reporting that all the pods within the system are in the healthy Running phase.

Now it’s time to disrupt that with some turbulent conditions, declared in your experiment’s method.

Injecting Turbulent Conditions in an Experiment’s Method

A chaos experiment’s method defines what conditions you want to cause in the effort to surface a weakness in the target system. In this case, your target system is a Kubernetes cluster. The challenge is that, out of the box, the Chaos Toolkit supports only a few limited and basic ways of causing chaos and probing a system’s properties.

Installing the Chaos Toolkit Kubernetes Driver

By default, the Chaos Toolkit knows nothing about Kubernetes, but you can solve that by installing an extension called a driver. The Chaos Toolkit itself orchestrates your chaos experiments over a host of different drivers, which are where the real smarts are for probing and causing chaos in specific systems (see Chapter 8 for more on the different ways to extend the toolkit with your own drivers and CLI plug-ins).

Use the following pip command to install the Chaos Toolkit Kubernetes driver:

(chaostk) $ pip install chaostoolkit-kubernetes

That’s all you need to add the capability to work with Kubernetes using the Chaos Toolkit. There is a growing list of other open source and even commercially maintained drivers that are available as well, and later in this book you’ll learn how to create your own driver; for now, though, all you need is the Kubernetes driver, and then you can use the toolkit to cause chaos from within your experiment’s method.

Using the Kubernetes Driver from Your Method

A Chaos Toolkit driver provides a collection of probes and actions. Probes allow you to inspect the properties of your target system, either in support of a condition in your steady-state hypothesis or simply as a way of harvesting some useful information while your experiment is running. Actions are the turbulent condition–inducing parts of the driver.

For your experiment, you are taking on the role of a business-as-usual Kubernetes Cluster Administrator who is blissfully attempting to drain a node for maintenance reasons, unaware of the impact of this action on the needs of the application the cluster is running. Specifically, you’re going to create an entry in your experiment’s method that uses the drain_nodes action from the Kubernetes driver.

First you create a method block under the steady-state-hypothesis declaration in your experiment:

"steady-state-hypothesis": {
    // Contents omitted...
},
"method": [] 1
1

Your experiment’s method may contain many actions and probes, so it is initialized with a JavaScript array.

Now you can add your action. An action starts with a type that indicates whether this is an action or a probe. If it is a probe, the result will be added to the experiment’s journal; otherwise, as an action it is simply executed, and any return value is ignored:

"steady-state-hypothesis": {
    // Contents omitted...
},
"method": [ {
    "type": "action",
}]

You can now name your action. You should aim to make this name as meaningful as possible, as it will be captured in the experiment’s journal when it records that this action has taken place:

"steady-state-hypothesis": {
    // Contents omitted...
},
"method": [ {
    "type": "action",
    "name": "drain_node",
}]

Now it’s time for the interesting part. You need to tell the Chaos Toolkit to use the drain_nodes action from your newly installed Chaos Toolkit Kubernetes driver:

"steady-state-hypothesis": {
    // Contents omitted...
},
"method": [ {
    "type": "action",
    "name": "drain_node",
    "provider": { 1
        "type": "python", 2
        "module": "chaosk8s.node.actions", 3
        "func": "drain_nodes", 4
        "arguments": { 5
            "name": "gke-disruption-demo-default-pool-9fa7a856-jrvm", 6
            "delete_pods_with_local_storage": true 7
        }
    }
}]
1

The provider block captures how the action is to be executed.

2

The provider in this case is a module written in Python.

3

This is the Python module that the Chaos Toolkit is supposed to use.

4

This is the name of the actual Python function to be called by this action.

5

The drain_nodes function expects some parameters, so you can specify those here.

6

You must specify the name of a current node in your cluster here. You can select a node from the list reported by kubectl get nodes.

7

Node drainage can be blocked if a node is attached to local storage; this flag tells the action to drain the node even if this is the case.

And that’s it—your experiment’s method is complete, as the only action we’re concerned with that may cause a weakness to surface is a Kubernetes Cluster Administrator doing their job by draining a node. At all times the conditions of the steady-state hypothesis should be within tolerance before and after this method has been executed if no weakness is present in the system.

Do All Extensions Have to Be Written in Python?

In short, no, extensions to the Chaos Toolkit do not need to be written in Python. The Chaos Toolkit was designed for extension, and so there are a few different ways you can implement your own drivers and plug-ins to integrate the toolkit into your tools and workflow.

Right about now you might be keen to run your experiment, but there is one more thing to consider. You will have caused turbulent conditions through your experiment’s method that you may want to consider reversing. That’s what an experiment’s rollbacks section is for.

Being a Good Citizen with Rollbacks

The term “rollback” has different meanings depending on your background. For databases, it can mean anything from resetting a database and its data to a prior state to manually manipulating the data as it is reverted to ensure no data is lost.

In a chaos experiment, a rollback is simply a remediating action. It’s something you know you can do to make your experiment a better citizen by re-establishing some system property that you manipulated in your experiment’s method.

During your experiment so far, you have deliberately attempted to remove a Kubernetes node from active duty, just as Cluster Administrators may do in the course of their regular work. As you know you’ve made this change, it makes sense to try to revert it so that you can at least attempt to leave the system in a state similar to the one it was in before your experiment’s method. You do this by defining a rollbacks section such as:

"rollbacks": [
    {
        "type": "action",
        "name": "uncordon_node",
        "provider": {
            "type": "python",
            "module": "chaosk8s.node.actions",
            "func": "uncordon_node", 1
            "arguments": {
                "name": "gke-disruption-demo-default-pool-9fa7a856-jrvm" 2
            }
        }
    }
]
1

The uncordon_node function from the Chaos Toolkit’s Kubernetes driver can be used to put back the node that was drained out of the cluster.

2

Make sure you change the name of the node to match the one that was drained in your experiment’s method. Once again, you can get the name of the node that you want to uncordon using kubectl get nodes.

To Roll Back or Not to Roll Back

While it’s a good idea to at least attempt to revert changes you might have made in your experiment’s method, doing so is by no means mandatory. You might be interested in observing how the turbulent conditions of an experiment affect the system over an extended period of time, in which case you might not have any remediating actions in your rollbacks section.

The same thinking affects the idea of having automatic rollbacks. We considered whether an experiment should make an attempt to automatically roll back any action by default. However, not only did this complicate the toolkit and the implementation of your drivers, but it also gave the wrong impression that a rollback is able to do its job entirely automatically; we wanted the chaos engineer to explicitly decide which actions are important and worthy of rolling back.

With your full experiment now complete, it’s time to run it so that you can explore and discover what weaknesses may be present.

Bringing It All Together and Running Your Experiment

Before you run your experiment, it’s a good idea to check that your target Kubernetes cluster is all set. You can use the kubectl apply command to set up the Application Team’s services on your cluster:

$ kubectl apply -f ./before

This command will set up a single service with three pod instances that can serve the traffic for that service.

Make Sure You Have Configured Kubectl to Talk to the Right Cluster

The Chaos Toolkit Kubernetes driver uses whatever cluster is selected for your kubectl command as the cluster that it targets for your experiment’s probes and actions. It’s good practice to check that you’re pointing at the right cluster target using kubectl before you run your experiment.

Now it’s time to run your experiment that explores whether “My application is robust to admin-instigated node drainage,.” You can do this by executing the chaos run command (make sure you specify your experiment file’s name if it’s not experiment.json):

(chaostk) $ chaos run experiment.json

As the Chaos Toolkit explores your system using your experiment, you’ll see the following output:

[2019-04-25 15:12:14 INFO] Validating the experiment's syntax
[2019-04-25 15:12:14 INFO] Experiment looks valid
[2019-04-25 15:12:14 INFO] Running experiment: My application is resilient to
admin-instigated node drainage
[2019-04-25 15:12:14 INFO] Steady state hypothesis: Services are all available
and healthy
[2019-04-25 15:12:14 INFO] Probe: application-must-respond-normally
[2019-04-25 15:12:14 INFO] Probe: pods_in_phase
[2019-04-25 15:12:14 INFO] Steady state hypothesis is met!
[2019-04-25 15:12:14 INFO] Action: drain_node
[2019-04-25 15:13:55 INFO] Action: drain_node
[2019-04-25 15:14:56 INFO] Steady state hypothesis: Services are all available
and healthy
[2019-04-25 15:14:56 INFO] Probe: application-must-respond-normally
[2019-04-25 15:14:56 INFO] Probe: pods_in_phase
[2019-04-25 15:14:56 ERROR]   => failed: chaoslib.exceptions.ActivityFailed: pod
'biz-app-id=retail' is in phase 'Pending' but should be 'Running' 1
[2019-04-25 15:14:57 WARNING] Probe terminated unexpectedly, so its tolerance
could not be validated
[2019-04-25 15:14:57 CRITICAL] Steady state probe 'pods_in_phase' is not in the
given tolerance so failing this experiment 2
[2019-04-25 15:14:57 INFO] Let's rollback...
[2019-04-25 15:14:57 INFO] Rollback: uncordon_node
[2019-04-25 15:14:57 INFO] Action: uncordon_node
[2019-04-25 15:14:57 INFO] Rollback: uncordon_node
[2019-04-25 15:14:57 INFO] Action: uncordon_node
[2019-04-25 15:14:57 INFO] Experiment ended with status: deviated
[2019-04-25 15:14:57 INFO] The steady-state has deviated, a weakness may have
been discovered

Success! Your experiment has potentially found a weakness: when you, as a Cluster Administrator, drained two nodes from the Kubernetes cluster, you left the system unable to honor the need for three instances (pods) of your service. Note that in this case you’ve rolled back the changes, but you could instead choose to leave the system degraded to observe what happens next.

One way of overcoming this weakness is to simply prevent a Cluster Administrator from being able to drain a node from the cluster when it would leave the requirements of the Application Team’s services unsatisfied. You could do this by setting some rules for Cluster Administrators to follow, but there would always be the chance of someone making a mistake (humans, like all complex systems, are fallible!) Thus it might be useful to apply a policy to the system that protects it. That’s what a Kubernetes Disruption Budget is for!

Stopping a Running Automated Chaos Experiment!

At some point you might want to stop an automated chaos experiment mid-execution if things start to go seriously wrong. You can control this sort of behavior using the Control concept in the Chaos Toolkit—see Part III for more on how to do this.

Overcoming a Weakness: Applying a Disruption Budget

A Kubernetes Disruption Budget limits the number of pods of a replicated application that are down simultaneously from voluntary disruption. Your experiment relies on the voluntary eviction of pods from the nodes that it is draining, so the Application Team can safeguard their pods and stop the Cluster Administrators from draining too many nodes at once by enforcing a Disruption Budget resource in the Kubernetes cluster:

{
	"apiVersion": "policy/v1beta1",
	"kind": "PodDisruptionBudget",
	"metadata": {
		"name": "my-app-pdb"
	},
	"spec": {
		"minAvailable": 3, 1
		"selector": {
			"matchLabels": {
				"name": "my-app" 2
			}
		}
	}
}
1

Specifies the minimum number of pods that must be in the READY state at a given moment in time.

2

Selects the pods to be protected by this Disruption Budget.

Next, you can apply this Disruption Budget to your cluster by executing the following command:

$ kubectl apply -f ./after

Now you can use your very own chaos experiment as a test to ensure that the weakness detected before is no longer present.

Success (as a tester)! Your chaos-experiment-turned-test passes, as there is no deviation from the steady-state hypothesis, and thus the original weakness is not present.

Summary

In this chapter you’ve walked through the entire flow of creating an experiment, learning from it, and then using the experiment to test that the weakness that was found was overcome. But chaos experiments rarely get created and applied in a vacuum. It’s now time to see how you can collaborate with others on your automated chaos experiments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.160.43