Chain of Failure

Underneath every system outage is a chain of events like this. One small issue leads to another, which leads to another. Looking at the entire chain of failure after the fact, the failure seems inevitable. If you tried to estimate the probability of that exact chain of events occurring, it would look incredibly improbable. But it looks improbable only if you consider the probability of each event independently. A coin has no memory; each toss has the same probability, independent of previous tosses. The combination of events that caused the failure is not independent. A failure in one point or layer actually increases the probability of other failures. If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent.

Here’s some common terminology we can use to be precise about these chains of events:

Fault

A condition that creates an incorrect internal state in your software. A fault may be due to a latent bug that gets triggered, or it may be due to an unchecked condition at a boundary or external interface.

Error

Visibly incorrect behavior. When your trading system suddenly buys ten billion dollars of Pokemon futures, that is an error.

Failure

An unresponsive system. When a system doesn’t respond, we say it has failed. Failure is in the eye of the beholder...a computer may have the power on but not respond to any requests.

Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate.

At each step in the chain of failure, the crack from a fault may accelerate, slow, or stop. A highly complex system with many degrees of coupling offers more pathways for cracks to propagate along, more opportunities for errors.

Tight coupling accelerates cracks. For instance, the tight coupling of EJB calls allowed a resource exhaustion problem in CF to create larger problems in its callers. Coupling the request-handling threads to the external integration calls in those systems caused a remote problem to turn into downtime.

One way to prepare for every possible failure is to look at every external call, every I/O, every use of resources, and every expected outcome and ask, “What are all the ways this can go wrong?” Think about the different types of impulse and stress that can be applied:

  • What if it can’t make the initial connection?

  • What if it takes ten minutes to make the connection?

  • What if it can make the connection and then gets disconnected?

  • What if it can make the connection but doesn’t get a response from the other end?

  • What if it takes two minutes to respond to my query?

  • What if 10,000 requests come in at the same time?

  • What if the disk is full when the application tries to log the error message about the SQLException that happened because the network was bogged down with a worm?

That’s just the beginning of everything that can go wrong. The exhaustive brute-force approach is clearly impractical for anything but life-critical systems or Mars rovers. What if you actually have to deliver in this decade?

Our community is divided about how to handle faults. One camp says we need to make systems fault-tolerant. We should catch exceptions, check error codes, and generally keep faults from turning into errors. The other camp says it’s futile to aim for fault tolerance. It’s like trying to make a fool-proof device: the universe will always deliver a better fool. No matter what faults you try to catch and recover from, something unexpected will always occur. This camp says “let it crash” so you can restart from a known good state.

Both camps agree on two things, though. Faults will happen; they can never be completely prevented. And we must keep faults from becoming errors. You have to decide for your system whether it’s better to risk failure or errors—even while you try to prevent failures and errors. We’ll look at some patterns that let you create shock absorbers to relieve those stresses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.62