Incident response is the next level in the hierarchy we introduced in Chapter 1, Introduction. While introducing monitoring into your organization, or project, is mostly a technical exercise, incident response is almost all process and people. Incident response builds upon the world of data that we built using monitoring and starts our feedback loop, helping us to find and tighten our monitoring for our services. This shows us what is important, because if we do not get an alert and someone tells us that our service was not working, then we were not monitoring the right things.
The inverse of this is also true, although rarer—when we do not launch or deliver new features because we spend all of our time trying to reach a level of reliability that our service has trouble reaching.
In this chapter, we will define incidents and explain how to respond to them. The chapter will help you to establish processes of alerting and communicating during an incident. This will set you and your team up for success when responding to the fact that everything is broken and no one is happy.
Let us start with the basics—incidents are scary. They cause our bodies to produce adrenaline and make our hearts race. They force us to stop what we are doing and reevaluate. Commonly, when responding to incidents, we try to figure out what is going on and what is not working while in a panicked state.
An incident is when something significant happens and it requires you to change your path and normal actions. It could be a cup of coffee spilling on you and you need to change your clothes. An accident could happen during your commute to work, causing you to have to take a different route.
You could fall and break your arm and have to spend the next three months wearing a cast. All of these incidents require you to do something immediately and then, often, make longer-term changes to your plans.
In software, incidents can be similar. They tend to be a failure due to a change in the system (either in the system itself or the input the system is receiving) or something changing about the environment the system runs in. The system itself changing is usually due to the deployment of code or a change in scale that the system operates in.
The input to a web server changing could be due to many different types of changes. It could be a malicious user sending millions of corrupted packets to your server. It could be an article in The New York Times about your service that means you get a million new users. It could also just be the launch of a new feature that adds new types of connections to your servers.
The environment changing could be a database upgrade, new hardware with faulty RAM, a Linux kernel upgrade that accidentally changes the permissions of some files, or even a classic network partition. An environment change could even be your database no longer being available.
The fact that the world is made up of constant chaos means that change can come from any direction. The above are some possible examples and are just a subset of the infinite number of things that could go wrong. Incidents are often proof that we, as humans, are not omnipotent. We write the software and do not know everything, so events happen that we have not planned for, which can often cause software to act in ways that are not to our liking.
3.138.174.174