Chapter 3. Incident Response

Incident response is the next level in the hierarchy we introduced in Chapter 1, Introduction. While introducing monitoring into your organization, or project, is mostly a technical exercise, incident response is almost all process and people. Incident response builds upon the world of data that we built using monitoring and starts our feedback loop, helping us to find and tighten our monitoring for our services. This shows us what is important, because if we do not get an alert and someone tells us that our service was not working, then we were not monitoring the right things.

The inverse of this is also true, although rarer—when we do not launch or deliver new features because we spend all of our time trying to reach a level of reliability that our service has trouble reaching.

Incident Response

Figure 1: Our current position in the hierarchy

In this chapter, we will define incidents and explain how to respond to them. The chapter will help you to establish processes of alerting and communicating during an incident. This will set you and your team up for success when responding to the fact that everything is broken and no one is happy.

What is an incident?

Let us start with the basics—incidents are scary. They cause our bodies to produce adrenaline and make our hearts race. They force us to stop what we are doing and reevaluate. Commonly, when responding to incidents, we try to figure out what is going on and what is not working while in a panicked state.

An incident is when something significant happens and it requires you to change your path and normal actions. It could be a cup of coffee spilling on you and you need to change your clothes. An accident could happen during your commute to work, causing you to have to take a different route.

You could fall and break your arm and have to spend the next three months wearing a cast. All of these incidents require you to do something immediately and then, often, make longer-term changes to your plans.

In software, incidents can be similar. They tend to be a failure due to a change in the system (either in the system itself or the input the system is receiving) or something changing about the environment the system runs in. The system itself changing is usually due to the deployment of code or a change in scale that the system operates in.

What is an incident?

Figure 2: In this example diagram, we have a load balancer sending traffic to four identical pieces of software. The load balancer is sending equivalent traffic to all four. However, because of a bug, some requests are returning invalid results. To all automated systems this service looks healthy, but it has bugs that are causing real damage to the business by giving users undesirable results.

The input to a web server changing could be due to many different types of changes. It could be a malicious user sending millions of corrupted packets to your server. It could be an article in The New York Times about your service that means you get a million new users. It could also just be the launch of a new feature that adds new types of connections to your servers.

What is an incident?

Figure 3: In this diagram, increased traffic has caused three out of four of our pieces of software to die. The fourth one is probably about to fail. The software likely did not have enough memory or CPU to handle the requests, so stopped being able to respond to health checks and the load balancer marked them as unhealthy or dead. The dead tasks are no longer receiving traffic.

The environment changing could be a database upgrade, new hardware with faulty RAM, a Linux kernel upgrade that accidentally changes the permissions of some files, or even a classic network partition. An environment change could even be your database no longer being available.

What is an incident?

Figure 4: In our third example diagram, there is some sort of network instability, so two of the pieces of software cannot talk to the database on the right. The load balancer in this case is not including database connectivity as part of its health check, so the task is still in service. That being said, 50% of requests are probably incorrect, as the tasks cannot talk to the database.

The fact that the world is made up of constant chaos means that change can come from any direction. The above are some possible examples and are just a subset of the infinite number of things that could go wrong. Incidents are often proof that we, as humans, are not omnipotent. We write the software and do not know everything, so events happen that we have not planned for, which can often cause software to act in ways that are not to our liking.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.174.174