Chapter 4. Improving Your Availability When It Slips

Your application is operational and online. Your systems are in place, and your team is operating efficiently. Everything seems to be going well. Your traffic is steadily increasing and your sales organization is very happy. All is well.

Then there’s a bit of a slip. Your system suffers an unanticipated outage. But that’s OK; your availability has been fantastic until now. A little outage is no big deal. Your traffic is still increasing. Everyone shrugs it off—it was just “one of those things.”

Then it happens again—another outage. Oops. Well, OK. Overall, we’re still doing well. No need to panic; it was just another “one of those things.”

Then another outage…

Now your CEO is a bit concerned. Customers are beginning to ask what’s going on. Your sales team is starting to worry.

Then another outage…

Suddenly, your once stable and operational system is becoming less and less stable; your outages are getting more and more attention.

Now you’ve got real problems.

What happened? Keeping your system highly available is a daunting task. What do you do if availability begins to slip? What do you do if your application availability has fallen or begins to fall, and you need to improve it to keep your customers satisfied?

Knowing what you can do when your availability begins to slip will help you to avoid falling into a vicious cycle of problems. The following sections outline some steps you can take when your availability begins to falter.

Measure and Track Your Current Availability

To understand what is happening to your availability, you must first measure what your current availability is. By tracking when your application is available and when it isn’t gives you an availability percentage that can show how you are performing over a specific period of time. You can use this to determine if your availability is improving or faltering.

You should continuously monitor your availability percentage and report the results on a regular basis. On top of this, overlay key events in your application, such as when you performed system changes and improvements. This way you can see whether there is a correlation over time between system events and availability issues. This can help you to identify risks to your availability.


Refer back to Chapter 3 if you need a refresher on how to measure availability.

Next, you must understand how your application can be expected to perform from an availability standpoint. A tool that you can use to help manage your application availability is service tiers. These are simply labels associated with services that indicate how critical a service is to the operation of your business. This allows you and your teams to distinguish between mission-critical services and those that are valuable but not essential. We’ll discuss service tiers in more depth in Chapter 16.

Finally, create and maintain a risk matrix. With this tool, you can gain visibility into the technical debt and associated risk present in your application. Risk matrices are covered more fully in Chapter 7 and risk in general is discussed in Chapters 5 and 6.

Now that you have a way to track your availability and a way of identifying and managing your risk, you will want to review your risk management plans on a regular basis.

Additionally, you should create and implement mitigation plans to reduce your application risks. This will give you a concrete set of tasks you and your development teams can implement to tackle the riskiest parts of your application. This is discussed in detail in Chapter 8.

Automate Your Manual Processes

To maintain high availability, you need to remove unknowns and variables. Performing manual operations is a common way to insert variable results and/or unknown results into your system.

You should never perform a manual operation on a production system.

When you make a change to your system, the change might improve or it might compromise your system. Using only repeatable tasks gives you the following:

  • The ability to test a task before implementing it. Testing what happens when you make a specific change is critical to avoiding mistakes that cause outages.

  • The ability to tweak the task to perform exactly what you want the task to do. This lets you implement improvements to the change you are about to make, before you make them.

  • The ability to have the task reviewed by a third party. This increases the likelihood that the task will not have unexpected side effects.

  • The ability to put the task under version control. Version control systems allow you to determine when the task is changed, by who, and for what reasons.

  • The ability to apply the task to related resources. Making a change to a single server that improves how that server works is great. Being able to apply the same change to every affected server in a consistent way makes the task even more useful.

  • The ability to have all related resources act consistently. If you continuously make “one off” changes to resources such as servers, the servers will begin to drift and act differently from one another. This makes it difficult to diagnose problematic servers because there will be no baseline of operational expectation that you can use for comparison.

  • The ability to implement repeatable tasks. Repeatable tasks are auditable tasks. Auditable tasks are tasks that you can analyze later for their impact, positive or negative, on the system as a whole.

There are many systems for which no one has access to the production environment. Period. The only access to production is through automated processes and procedures. The owners of these systems lock down their environments like this specifically for the aforementioned reasons.

In summary, if you can’t repeat a task, it isn’t a useful task. There are many places where adding repeatability to changes will help keep your system and application stable. This includes server configuration changes, performance tuning tweaks and adjustments, restarting servers, restarting jobs and tasks, changing routing rules, and upgrading and deploying software packages.

Automated Deploys

By automating deploys, you guarantee changes are applied consistently throughout your system, and that you can apply similar changes later with known results. Additionally, rollbacks to known good states become more reliable with automated deployment systems.

Configuration Management

Rather than “tweaking a configuration variable” in the kernel of a server, use a process to apply the change in an automated manner. For example, write a script that will make the change, and then check that script into your software change management system. This enables you to make the same change to all servers in your system uniformly. Additionally, when you need to add new servers to your system or replace old ones, having a known configuration that can be applied improves the likelihood that you can add the new server to your system safely, with minimal impact. Tools like Puppet and Chef can help make this process easier to manage.

The same applies to all infrastructure components, not just servers. This includes switches, routers, network components, and monitoring applications and systems.

For configuration management to be useful, it must be used for all system changes, all the time. It is never acceptable to bypass the configuration management system to make a change under any circumstances. Ever.

Change Experiments and High Frequency Changes

Another advantage of having a highly repeatable, highly automated process for making changes and upgrades to your system is that it allows you to experiment with changes. Suppose that you have a configuration change you want to make to your servers that you believe will improve their performance in your application (such as the maximum number of open files change described in “Don’t Worry, I Fixed It”). By using an automated change management process, you can do the following:

  • Document your proposed change.

  • Review the change with people who are knowledgeable and might be able to provide suggestions and improvements.

  • Test the change on servers in a test or staging environment.

  • Deploy your change quickly and easily.

  • Examine the results quickly. If the change didn’t have the desired results, you can quickly roll back to a known good state.

The keys to implementing this process are to have an automated change process with rollback capabilities, and to have the ability to make small changes to your system easily and often.1 The former lets you make changes consistently, the latter lets you experiment and roll back failed experiments with little to no impact on your customers.

Automated Change Sanity Testing

By having an automated change and deploy process,2 you can implement an automated sanity test of all changes. You can use a browser testing application for web applications or use something such as New Relic Synthetics to simulate customer interaction with your application.

When you are ready to deploy a change to production, you can have your deployment system first automatically deploy the change to a test or staging environment. You can then have these automated tests run and validate that the changes did not break your application.

If and when those tests pass, you can automatically deploy the change in a consistent manner to your production environment. Depending on how your tests are constructed, you should be able to run the tests regularly on your production environment, as well, to validate that no changes break anything there.

By making the entire process automated, you can increase your confidence that a change will not have a negative impact on your production systems.

Improve Your Systems

Now that you have a system to monitor availability, a way to track risk and mitigations in your system, and a way to easily and safely apply consistent changes to your system, you can focus your efforts on improving the availability of your application itself.

Regularly review your risk matrix (discussed earlier in this chapter and in Chapter 7) and your recovery plans. Make reviewing them part of your postmortem process. Execute projects that are designed to mitigate the risks identified in your matrix. Roll out those changes in an automated and safe way, using the sanity tests discussed earlier. Examine how the mitigation has improved your avaiability. Continue the process until your availability reaches the level you want and need it to be.

You can learn about how to recover from failing services in Chapter 13.

Publish availability metrics to your management chain. This visibility will help with justifying projects such as these to improve your system availability.

Your Changing and Growing Application

As your system grows, you’ll need to handle larger and larger traffic and data demands. This increase in traffic and data can cause availability issues to compound. Part IV provides extensive coverage of application scaling, and many of the topics discussed in that part will help in improving an application that is experiencing availability issues. In particular, managing mistakes and errors at scale is discussed in Chapter 14. Service-level agreement (SLA) management is discussed in Chapter 18. Service tiers, which you can use to identify key availability-impacting services, are discussed in Chapters 16 and 17.

Implement Game Day testing, which measures how your application performs in various failure modes. This is discussed further in Chapter 9.

Keeping on Top of Availability

Typically, your application will change continuously. As such, your risks, mitigations, contingencies, and recovery plans need to constantly change.

Knowing what you can do when your availability begins to slip will help you to avoid falling into a vicious cycle of problems. The ideas in this chapter will help you manage your application and your team to avoid this cycle and keep your availability high.

1 According to Werner Vogels, CTO of Amazon, in 2014 Amazon did 50 million deploys to individual hosts. That’s about one every second.

2 This could be, but does not need to be a modern continuous integration and continuous deploy (CI/CD) process.

