Chapter 5: Improving Your Availability When it Slips

Chapter 5.

Your application is operational and online. Your systems are in place, and your team is operating efficiently. Everything seems to be going well. Your traffic is steadily increasing and your sales organization is very happy. All is well.

Then there’s a bit of a slip. Your system suffers an unanticipated outage. But that’s OK; your availability has been fantastic until now. A little outage is no big deal. Your traffic is still increasing. Everyone shrugs it off—it was just “one of those things.”

Then it happens again—another outage. Oops. Well, OK. Overall, we’re still doing well. No need to panic; it was just another “one of those things.”

Then another outage...

Now your CEO is a bit concerned. Customers are beginning to ask what’s going on. Your sales team is starting to worry.

Then another outage...

Suddenly, your once stable and operational system is becoming less and less stable; your outages are getting more and more attention.

Now you’ve got real problems.

What happened? Keeping your system highly available is a daunting task. What do you do if availability begins to slip? What do you do if your application availability has fallen or begins to fall, and you need to improve it to keep your customers satisfied?

Knowing what you can do when your availability begins to slip will help you to avoid falling into a vicious cycle of problems. The following sections outline some steps you can take when your availability begins to falter.

Measure and Track Your Current Availability

To understand what is happening to your availability, you must first measure what your current availability is. By tracking when your application is available and when it isn’t gives you an availability percentage that can show how you are performing over a specific period of time. You can use this to determine if your availability is improving or faltering.

You should continuously monitor your availability percentage and report the results on a regular basis. On top of this, overlay key events in your application, such as when you performed system changes and improvements. This way you can see whether there is a correlation over time between system events and availability issues. This can help you to identify risks to your availability.

Refer back to Measuring Availability if you need a refresher on how to measure availability.

Next, you must understand how your application can be expected to perform from an availability standpoint. A tool that you can use to help manage your application availability is service tiers. These are simply labels associated with services that indicate how critical a service is to the operation of your business. This allows you and your teams to distinguish between mission-critical services and those that are valuable but not essential. We’ll discuss service tiers in more depth in Service Tiers.

Finally, create and maintain a risk matrix. With this tool, you can gain visibility into the technical debt and associated risk present in your application. Risk matrices are covered more fully in The Risk Matrix and risk in general is discussed in What Is Risk Management? and Likelihood Versus Severity.

Now that you have a way to track your availability and a way of identifying and managing your risk, you will want to review your risk management plans on a regular basis.

Additionally, you should create and implement mitigation plans to reduce your application risks. This will give you a concrete set of tasks you and your development teams can implement to tackle the riskiest parts of your application. This is discussed in detail in Risk Mitigation.

Automate Your Manual Processes

To maintain high availability, you need to remove unknowns and variables. Performing manual operations is a common way to insert variable results and/or unknown results into your system.

You should never perform a manual operation on a production system.

When you make a change to your system, the change might improve or it might compromise your system. Using only repeatable tasks gives you the following:

The ability to test a task before implementing it. Testing what happens when you make a specific change is critical to avoiding mistakes that cause outages.

The ability to tweak the task to perform exactly what you want the task to do. This lets you implement improvements to the change you are about to make, before you make them.

The ability to have the task reviewed by a third party. This increases the likelihood that the task will not have unexpected side effects.

The ability to put the task under version control. Version control systems allow you to determine when the task is changed, by who, and for what reasons.

The ability to apply the task to related resources. Making a change to a single server that improves how that server works is great. Being able to apply the same change to every affected server in a consistent way makes the task even more useful.

The ability to have all related resources act consistently. If you continuously make “one off “ changes to resources such as servers, the servers will begin to drift and act differently from one another. This makes it difficult to diagnose problematic servers because there will be no baseline of operational expectation that you can use for comparison.

The ability to implement repeatable tasks. Repeatable tasks are auditable tasks. Auditable tasks are tasks that you can analyze later for their impact, positive or negative, on the system as a whole.

There are many systems for which no one has access to the production environment. Period. The only access to production is through automated processes and procedures. The owners of these systems lock down their environments like this specifically for the aforementioned reasons.

In summary, if you can’t repeat a task, it isn’t a useful task. There are many places where adding repeatability to changes will help keep your system and application stable. This includes server configuration changes, performance tuning tweaks and adjustments, restarting servers, restarting jobs and tasks, changing routing rules, and upgrading and deploying software packages.

Automated Deploys

By automating deploys, you guarantee changes are applied consistently throughout your system, and that you can apply similar changes later with known results. Additionally, rollbacks to known good states become more reliable with automated deployment systems.

Configuration Management

Rather than “tweaking a configuration variable” in the kernel of a server, use a process to apply the change in an automated manner.

At the very least, write a script that will make the change, and then check that script into your software change management system. This enables you to make the same change to all servers in your system uniformly. Additionally, when you need to add new servers to your system or replace old ones, having a known configuration that can be applied improves the likelihood that you can add the new server to your system safely, with minimal impact.

But even better — and consistent with modern, state of the art, configuration management best practices — is to employ a concept called Infrastructure as Code. Infrastructure as Code involves describing your infrastructure in a standard, machine readable specification. Then pass that specification thru an infrastructure tool that will create and/or update your infrastructure and your configuration to match the specification. Tools like Puppet and Chef can help make this process easier to manage.

Then, take this specification, and check it into your revision control system, so that changes to the specification can be tracked just like code changes can be tracked. Running the specification thru the infrastructure tool anytime a change is made to the specification will update your live infrastructure to match the specification.

If anyone needs to make a change to the infrastructure or its configuration, they must make the change to the specification, check the change into revision control, then “deploy” the change via the infrastructure tool to update your live infrastructure to match. In this manner, you can:

Ensure all components of the infrastructure have a consistent, known, and stable configuration.

Track all changes to the infrastructure so they can be rolled back if needed, or used to assist in correlation with system events and outages.

Allow a peer review process, similar to a code review process, to ensure changes to your infrastructure are correct and appropriate.

Allow creating duplicate environments to assist in testing, staging, and development with an environment identical to production.

This same sort of process applies to all infrastructure components. This includes servers and their operating system configuration, but also other cloud components, VPCs, load balancers, switches, routers, network components, and monitoring applications and systems.

For Infrastructure as Code management to be useful, it must be used for all system changes, all the time. It is never acceptable to bypass the infrastructure management system to make a change under any circumstances. Ever.

Don’t Worry, I Fixed It

You would be surprised the number of times I have received an operational update email that said something like: “We had a problem with one of our servers last night. We hit a limit to the maximum number of open files the server could handle. So I tweaked the kernel variable and increased the maximum number of open files, and the server is operational again.”

That is, it is operating correctly until someone accidentally overwrites the change because there was no documentation of the change. Or, until one of the other servers running the application has the same problem, but did not have this change applied.

Or someone makes another change, which breaks because it is inconsistent with the undocumented change you just made.

Consistency, repeatability, and unfaltering attention to detail is critical to make a configuration management process work. And a standard and repeatable configuration management process such as we describe here is critical to keeping your scaled system highly available.

Change Experiments and High Frequency Changes

Another advantage of having a highly repeatable, highly automated process for making changes and upgrades to your system is that it allows you to experiment with changes. Suppose that you have a configuration change you want to make to your servers that you believe will improve their performance in your application (such as the maximum number of open files change described in “Don’t Worry, I Fixed It” in “Configuration Management”). By using an automated change management process, you can do the following:

Document your proposed change.

Review the change with people who are knowledgeable and might be able to provide suggestions and improvements.

Test the change on servers in a test or staging environment.

Deploy your change quickly and easily.

Examine the results quickly. If the change didn’t have the desired results, you can quickly roll back to a known good state.

The keys to implementing this process are to have an automated change process with rollback capabilities, and to have the ability to make small changes to your system easily and often.1 The former lets you make changes consistently, the latter lets you experiment and roll back failed experiments with little to no impact on your customers.

Automated Change Sanity Testing

By having an automated change and deploy process,2 you can implement an automated sanity test of all changes. You can use a browser testing application for web applications or use something such as New Relic Synthetics to simulate customer interaction with your application.

When you are ready to deploy a change to production, you can have your deployment system first automatically deploy the change to a test or staging environment. You can then have these automated tests run and validate that the changes did not break your application.

If and when those tests pass, you can automatically deploy the change in a consistent manner to your production environment. Depending on how your tests are constructed, you should be able to run the tests regularly on your production environment, as well, to validate that no changes break anything there.

By making the entire process automated, you can increase your confidence that a change will not have a negative impact on your production systems.

Improve Your Systems

Now that you have a system to monitor availability, a way to track risk and mitigations in your system, and a way to easily and safely apply consistent changes to your system, you can focus your efforts on improving the availability of your application itself.

Regularly review your risk matrix (discussed earlier in this chapter and in The Risk Matrix) and your recovery plans. Make reviewing them part of your postmortem process. Execute projects that are designed to mitigate the risks identified in your matrix. Roll out those changes in an automated and safe way, using the sanity tests discussed earlier. Examine how the mitigation has improved your avaiability. Continue the process until your availability reaches the level you want and need it to be.

You can learn about how to recover from failing services in Dealing with Service Failures.

Publish availability metrics to your management chain. This visibility will help with justifying projects such as these to improve your system availability.

Your Changing and Growing Application

As your system grows, you’ll need to handle larger and larger traffic and data demands. This increase in traffic and data can cause availability issues to compound. Part provides extensive coverage of application scaling, and many of the topics discussed in that part will help in improving an application that is experiencing availability issues. In particular, managing mistakes and errors at scale is discussed in Two Mistakes High. Service-level agreement (SLA) management is discussed in Service-Level Agreements. Service tiers, which you can use to identify key availability-impacting services, are discussed in Service Tiers and Using Service Tiers.

Implement Game Day testing, which measures how your application performs in various failure modes. This is discussed further in Game Days.

Keeping on Top of Availability

Typically, your application will change continuously. As such, your risks, mitigations, contingencies, and recovery plans need to constantly change.

Knowing what you can do when your availability begins to slip will help you to avoid falling into a vicious cycle of problems. The ideas in this chapter will help you manage your application and your team to avoid this cycle and keep your availability high.

1 According to Werner Vogels, CTO of Amazon, in 2014 Amazon did 50 million deploys to individual hosts. That’s about one every second.

2 This could be, but does not need to be a modern continuous integration and continuous deploy (CI/CD) process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.229.21