Chapter 40. Test Your Infrastructure with Game Days

Fernando Duran

“You don’t have a backup until you have performed a restore” is a good aphorism, and in a similar way, we can say that your service or infrastructure is not fully resilient if you haven’t tried breaking it and recovering.

A game day is a planned rehearsal exercise in which a team tries to recover from an incident. It tests your readiness and reliability in the face of an emergency in a production environment.

The motivation is for the teams and the code to be ready when incidents occur; therefore, you want the test incident to resemble a real-life incident. When you run these experiments in production environments and in an automated way, this is called chaos engineering.

There are several items to consider when preparing for a game day. Most importantly, you need to decide whether you are running the exercise in production. This is the ideal, since any staging or test environments are never really going to be the same as production. But on the other hand, you have to comply with your SLAs, obtain approval, and warn customers if needed. If you have never done a game day or if the target system has never been tested for disruption, then start with a test environment.

Another decision is whether you want the procedure to be planned and triggered by an adversarial red team—in this case, one team or person very familiar with the system will create the failure without warning.

You’ll want to run these game days periodically (every four or six months, for example), as well as after a new service or new infrastructure has been added. This time frame may also depend on the recent history of past responses to real incidents. An incident exercise should run for a few hours at most; you don’t want long-lived lingering effects.

Different types of failure can be introduced at different layers:

  • Server resources (such as high CPU and memory usage)

  • Application (for example, processes being killed)

  • Network (unreliable networking or network traffic degradation: adding latency, packet loss, blocked communication, DNS failures)

Gray failures (degradation of services) are often worse than complete crashes, since the latter have a short feedback loop. Also, degradation can be hard to produce.

Before running a game day, you need to determine the following:

  • The failure scenario or scenarios

  • The scope of systems affected and what can go wrong (having a contained “blast radius”)

  • The condition of victory, or acceptance criteria for the system to be considered “fixed”

  • The time window for recovery—estimate the duration and add 2× or 3× just in case

  • The date and time

  • Whether you are warning beforehand

  • The team or people on call at the time who will handle the incident

You also need to prepare the response team(s) and how they are going to work. A common approach to incident management is to have one person focused on solving the problem surrounded by supporting people or teams so that person doesn’t have to worry about anything else. For communications, a chat channel is better than email or phone since it works in real time, allows multiple people to collaborate, and leaves a written log.

The incident manager—the person leading the response team—doesn’t have to be a manager or the person who is most familiar with the system. Indeed, it’s better for knowledge dissemination that it be a different person. Besides, you want to make sure you don’t have a “bus factor” of one. If there are runbooks to recover from an incident, this is also a way to test such documentation, by having someone different from the author going through it.

You may want to also have a coordinator to answer to business units and executives; you don’t want them asking for updates and distracting the incident manager. Other teams can be observers; game day should be a learning opportunity for all.

During game day, document while the incident is ongoing, with timestamps, observations, and actions taken. After game day, perform a postmortem to answer questions like these:

  • What did we do and how can we do better?

  • Did the monitoring tools alert correctly in the first place, and were those alerts routed by the pager system to the person or teams on call?

  • Did the incident team have enough information from the monitoring, logging, and metrics systems?

  • Did the incident team make use of documentation like playbooks and checklists?

  • Did the members of the incident team collaborate well?

If needed, update the technical documentation and procedures, and disseminate the lessons learned.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.47.253