Chapter 3: Two Mistakes High

Chapter 3.  

Surely, this is OK...

Consider the following anecdote I once overheard:

We were wondering how changing a setting on our MySQL database might impact our performance, but we were worried that the change might cause our production database to fail. Because we didn’t want to bring down production, we decided to make the change to our backup (replica) database, instead. After all, it wasn’t being used for anything at the moment.

Makes sense, right? Have you ever heard this rationale before?

Well, the problem here is that the database was being used for something. It was being used to provide a backup for production. Except, it couldn’t be used that way anymore.

You see, the backup database was essentially being used as an experimental playground for trying different types of settings. The net result was that the backup database began to drift away from the primary production database as settings began to change over time.

Then, one day, the inevitable happened.

The production database failed.

The backup database initially did what it was supposed to do. It took over the job of the primary database. Except, it really couldn’t. The settings on the backup database had wandered so far away from those required by the primary database that it could no longer reliably handle the same traffic load that the primary database handled.

The backup database slowly failed, and the site went down.

This is a true story. It’s a story about best intentions. You have a backup, replicated database on standby. It’s ready to take over as needed when the primary database fails. Except, that the backup database wasn’t treated with the same respect as the primary database, and it loses the ability to perform its main purposes, that of being the backup database.

Two wrongs don’t make a right, two mistakes don’t negate each other, and two problems don’t self-correct. A primary database failure along with a poorly managed backup server does not create a good day.

What Is “Two Mistakes High”?

If you’ve ever flown radio control (R/C) airplanes before, you might have heard the expression “keep your plane two mistakes high.”

When you learn to fly R/C planes, especially when you began learning how to do acrobatics, you learn this quickly. You see, mistakes equate to altitude. You make a mistake, you lose altitude. You lose too much altitude, and well, badness happens. Keeping your plane “two mistakes high” means keeping it high enough that you have enough altitude to recover from two, independent mistakes.

Why two mistakes? Simple. You always want to be operating your plane high enough so that you can recover if (when) you make a mistake. Now, suppose that you make a mistake and lose a bunch of altitude. During your recovery from that mistake, you also want to be high enough that you can recover from a mistake. Think about it: during your recovery process, you are typically stressed and perhaps in an awkward situation doing potentially abnormal things—just the type of situation that can cause you to make another mistake. If you aren’t high enough, you can crash.

Put another way, if you normally fly two mistakes high, you can always have a backup plan for recovering from a mistake, even if you are currently recovering from a mistake.

This same philosophy is important to understand when building highly available, high-scale applications.

How do we “keep two mistakes high” in an application? For starters, when we identify the failure scenarios that we anticipate our application facing, we walk through the ramifications of those scenarios and our recovery plan for them. We make sure the recovery plan itself does not have mistakes or other shortcomings built into it—in short, we check that the recovery plan is able to work. If we find that it doesn’t work, then it’s not a recovery plan.

“Two Mistakes High” in Practice

This is just one potential scenario for which “two mistakes” applies. There are many more. Let’s take a look at some example scenarios to see how this philosophy plays out in our applications.

Losing a Node

Let’s look at an example scenario involving traffic to a web service.

Suppose that you’re building a service that is designed to handle 1,000 requests per second (req/sec). Further, let’s assume that a single node in your service can handle 300 req/sec.

Question: How many nodes do you need to handle your traffic demands?

Some basic math should come up with a good answer:

number_of_nodes_needed = ⌈number_of_requests / requests_per_node⌉

where:

number_of_nodes_needed

The number of nodes needed to handle the specified number of requests.

number_of_requests

The design limit for the amount of requests the service is expected to happen.

requests_per_node

The expected average number of requests each node in the service can handle.

Putting in our numbers:

number_of_nodes_needed = ⌈1000 req/sec / 300 req/sec⌉ = ⌈3.3⌉ = 4 nodes

number_of_nodes_needed = 4 nodes

You need four nodes in your service to handle the 1,000 req/sec expected service load. Switching this around, using four nodes, each node will handle:

requests_per_node = number_of_requests / number_of_nodes

requests_per_node = 1,000 req/sec / 4 nodes = 250 req/sec/node

Each node will handle 250 req/sec, which is well below your 300 req/sec per node limit.

Figure 3-1.  

Figure 3-1 (FourNodes). Four Nodes, 250 req/sec each

You have four nodes in your system. You can handle the expected traffic, and because you have four nodes, you can handle the loss of a node. You have built in the ability to handle a node failure. Right? Right???

Well, no, not really. If you lose a node at peak traffic, your service will begin to fail. Why? Because if you lose a node, the rest of your traffic must be spread among the remaining three nodes. So:

requests_per_node = number_of_requests / number_of_nodes

requests_per_node = 1000 req/sec / 3 nodes = 333 req/sec/node

That’s 333 req/sec per node, which is well above your 300 req/sec node limit.

Because each node can handle only 300 req/sec, you have overloaded your servers. Either you will give poor performance to all your customers, or you will drop some requests, or you will begin to fail in other ways. In any case, you will begin to lose availability.

Figure 3-2.  

Figure 3-2. Four Nodes, one failure causes overflow

As you can see from Figure 3-2, if you lose a node in your system, you cannot continue to operate at full capacity. So, even though you think you can recover from a node failure, you really can’t. You are vulnerable.

To handle a node failure, you need more than four nodes. If you want to be able to handle a single node failure, you need five nodes. That way, if one of the five nodes fail, you still have four remaining nodes to handle the load:

requests_per_node = number_of_requests / number_of_nodes

requests_per_node = 1000 req/sec / 4 nodes = 250 req/sec/node

This is illustrated in Figure #fig(FiveNodesOneFailed). Because this value is below the node limit of 300 req/sec, there is enough capacity to continue handling all of your traffic, even with a single node failure.

Figure 3-3.  

Figure 3-3. Five Nodes, one failure can still be handled

Problems During Upgrades

Upgrades and routine maintenance can cause availability problems beyond just the obvious.

Suppose that you have a service whose average traffic is 1,000 req/sec. Further, let’s assume that a single node in your service can handle 300 req/sec. As discussed in the above example, four nodes is the required minimum to run your service. To handle the expected traffic and to be able to handle a single node failure, you give your service five nodes with which to handle the load.

Now, suppose that you want to do a software upgrade to the service running on the nodes. To keep your service operating at full capacity during the upgrade, you decide to do a rolling deploy.

Put simply, a rolling deploy means that you upgrade one node at a time (temporarily taking it offline to perform the upgrade). After the first node has been upgraded successfully and is handling traffic again, you move on to upgrade the second node (temporarily taking it offline). You continue until all five nodes are upgraded.

Because only one node is offline to be upgraded at any point in time, there are always at least four nodes handling traffic. Because four nodes is enough to handle all of your traffic, your service stays up and operational during the upgrade.

This is a great plan. You’ve built a system that not only can handle a single node failure, but it also can be upgraded by rolling deploys without having any downtime.

But what happens if a single node failure occurs during an upgrade? In that case, you have one node unavailable for the upgrade, and one node failed. That leaves only three nodes to handle all your traffic, which is not enough. You are experiencing a service degradation or outage.

But, what’s the likelihood of a node failure occurring during an upgrade?

How many times have you had an upgrade fail? In fact, an argument can be made that you are more prone to node failures around the time of an upgrade than at any other point in time. The upgrade and the node failure do not have to be independent.

[note]The lesson is this: even if you think you have redundancy to handle different failure modes, if it is likely that two or more problems can occur at the same time (because the problems are correlated), you essentially do not have redundancy at all. You are prone to an availability issue.

So, in summary, to handle the 1,000 req/sec expected traffic using nodes that can handle 300 req/sec each, we need:

Four nodes

Which can handle the traffic but will not handle a node failure.

Five nodes

Which handles a single node failure, or makes it possible for a node to be unavailable for maintenance or upgrade.

Six nodes

Which can handle a multinode failure, or makes it possible for you to survive single node failures while another node is down for maintenance or upgrade.

Data Center Resiliency

Let’s scale the problem up a bit and take a look at data center redundancy and resilience.

Suppose that your service is now handling 10,000 req/sec. With single nodes handling 300 req/sec, that means you need 34 nodes, without considering redundancy for failures and upgrades.

Let’s add a bunch of resiliency and use a total of 40 nodes (each handling 250 req/sec), which allows for plenty of extra capacity. We could lose up to six nodes, and still handle our full capacity.

Let’s do an even better job: let’s split those 40 nodes evenly across four data centers so that we have even more redundancy.

So, now we are resilient to data center outages as well as node failures. This is illustrated in Figure 3-4.

Right?

Figure 3-4.  

Figure 3-4. Four datacenters, 40 nodes, sufficient capacity to handle load

Well, good question. Obviously, we can handle individual node outages, because we have given ourselves 6 (40 − 34) extra nodes. But what if a data center goes offline?

If a single data center fails, we lose one quarter of our servers. In this example, we would go from 40 nodes to 30 nodes. Each node no longer must handle traffic of 250 req/sec; rather, they need to handle 334 req/sec. This is illustrated in Figure 3-5. Because this is more than the capacity of your fictitious nodes, you have an availability issue.

Figure 3-5.  

Figure 3-5. Four datacenters, one failed, 30 nodes, insufficient capacity to handle load

Although we used multiple data centers, a failure of just one of those data centers would leave us in a situation where we wouldn’t be able to handle increased traffic. We think we are resilient to a data center loss, but we are not.

Then, how many servers do you need?

How many servers do we need to have the ability to lose a data center? Let’s find out.

Using the same assumptions, we know that we need a minimum of 34 working servers to handle all of our traffic. If we are using four data centers, how many servers do we need to have true data center redundancy?

Well, we need to make sure we always have 34 working servers, even if one of the four data centers goes down. This means that we need to have 34 servers spread across three data centers:

nodes_per_data_center = ⌈minimum_number_of_servers / number_of_data_centers − 1⌉

nodes_per_data_center = ⌈34 / 4 − 1⌉

nodes_per_data_center = ⌈11.333⌉ = 12 servers/data_center

Because we need 12 servers per data center, and because any one of the four data centers could go offline, we need 12 in each data center:

total_nodes = nodes_per_data_center*4 = 48 nodes

We need 48 nodes to guarantee that you have 34 working servers in the case of a data center outage.

How does changing the number of data centers change our calculation? What if we have two data centers? As before:

nodes_per_data_center = ⌈minimum_number_of_servers / number_of_data_centers − 1⌉

nodes_per_data_center = ⌈34 / 2 − 1⌉

nodes_per_data_center = 34

total_nodes = nodes_per_data_center*2 = 68 nodes

If we have two data centers, we need 68 nodes. How about some other situations. If you have:

Four data centers

We need 48 nodes to maintain data center redundancy.

Six data centers

We need 42 nodes to maintain data center redundancy.

This demonstrates the seemingly odd conclusion:

To provide the ability to recover from an entire data center outage, the more data centers you have, the fewer nodes you need overall spread across those data centers.

So much for natural intuition. There is a lesson we can take from this. Although the details of this demonstration might not directly apply to real-world situation, the point still applies. Be careful when you devise your resiliency plans. Your intuition might not match reality, and if your intuition is wrong, you are prone to an availability issue.

Hidden Shared Failure Types

Sometimes, multiple problem scenarios that seem to be independent and not likely to occur together are, in fact, dependent scenarios. This means that they could, and in some situations, reasonably should be expected to fail together.

Suppose that your service runs on four nodes. You are trying to think ahead, so you use a total of six nodes—enough to handle both a single node failure and an upgrade in progress.

You’re all set. Your system is safe.

Then, it happens: in your data center, a power supply in a rack goes bad, and the rack goes dark.

It’s usually about this time that you realize that all six of your servers are in the same rack. How do you discover this? Because all six servers go down, and your service is completely down.

There goes redundancy...

Even when you think you are safe, you might not be. We know that not all problems are independent of one another. But this is a case where a potentially unseen, or at least unnoticed, commonality exists between all your servers: they all share the same rack and the same power supply for that rack.

Make sure to check for the hidden shared failure modes that can cause your carefully laid plans to be wrong, thus making you prone to an availability issue.

Failure Loops

Failure loops are when a specific problem causes your system to fail in a way that makes it difficult or impossible for you to fix the problem without causing a worse problem to occur.

The best way to explain this is with a non-server-based example. Suppose you live in a great apartment that even provides an enclosed garage for you to store things! Wow, you are set.

But the power in the place goes out a lot, so you decide to buy a generator that you can use when the power does go out. You take the generator, and the gas it uses, and you store it in the garage. Life is good.

Then, when the power goes out, you go to get your generator.

That’s when you realize for the first time that the only way to access your garage is through the electric-powered garage door—the one that doesn’t work because the power is out.

Oops.

Just because you have a backup plan does not mean you can implement the backup plan when needed.

The same issues can apply to our service world. Can a service failure make it difficult to repair that same service because it caused some other seemingly unrelated issue to occur?

For example, if your service fails, how easy is it to deploy an updated version of your service? What happens if your service-deployment service fails? What about if the service you use to monitor the performance of other services fails?

Make sure the plans you have for recovering from a problem can be implemented even when the problem is occurring. Dependent relationships between the problem and the solution to the problem can make you prone to an availability issue.

Managing Your Applications

“Fly two mistakes high” in our context means don’t just look for the surface failure modes. Look the next level down. Make sure that you do not have dependent failure modes and that the recovery mechanisms you have put in place will, in fact, recover your system while a failure is going on.

Additionally, don’t ignore problems. They don’t go away and they can interfere with your predicted availability plans. Just because the database that fails is only the backup database, doesn’t mean it isn’t mission-critical to fix. Treat your backup and redundant systems just as preciously as you treat your primary systems. After all, they are just as important.

As a friend of mine is often heard saying, “if it touches production, it is production.” Don’t take anything in production for granted.

This stuff is difficult. It isn’t at all obvious to know when you have these types of layered or dependent failures. Take the time to look at your situations and resolve them.

The Space Shuttle

Let’s end this chapter with a great example of an independent, redundant, multilevel error-recoverable system. In fact, it was one of the very first large-scale software applications that utilized extreme principles of redundancy and failure management. It had to—the astronauts’ lives depended on it.

I’m referring to the United States Space Shuttle program.

The Space Shuttle program had some significant and serious mechanical problems, which we won’t fully address here. But the software system built into the Space Shuttle utilized state-of-the art techniques for redundancy and independent error recovery.

The primary computer system of the Space Shuttle consisted of five computers. Four of them were identical computers with identical software running on them, but the fifth was different. We’ll discuss that later.

The four main computers all ran the exact same program during critical parts of the mission (such as launch and landing). The four computers were all given the same data and had the same software, and were expected to generate the same results. All four performed the same calculations, and they constantly compared the results. If, at any point in time, any of the computers generated a different result, the four computers voted on which result was correct. The winning result was used, and the computer(s) that generated the losing result were turned off for the duration of the flight. The shuttle could successfully fly with only three computers turned on, and it could safely land with only two operational computers.

Talk about the ultimate in democratic systems. The winners rule, and the losers are terminated.

But what would happen if the four computers couldn’t agree? This could happen if there was multiple failures and multiple computers had been shut down. Or, it could happen if a serious software glitch in the main software affected all four computers at the same time (the four computers were running the exact same software, after all).

This is where the fifth computer came into play. It normally sat idle, but if needed, it could perform the exact same calculations as the other four. The key was the software it ran. The software for the fifth system was a much simpler version of the software that was built by a completely independent group of programmers. In theory, it could not have the same software errors as the main software.

So, if the main software and the four main computers could not agree on a result, it left the final result to the fifth, completely independent computer.

This is a highly redundant, high availability system with a high level of separation between potential problems.

During its 30 years of operation, the Space Shuttle program never experienced a serious life-threatening problem during any of its missions that was a result of the failure of the software or the computers they ran—even though the software was, at the time, the most complex software system ever built for a space program to use.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.225.173