Let It Crash

Sometimes the best thing you can do to create system-level stability is to abandon component-level stability. In the Erlang world, this is called the “let it crash” philosophy. We know from Chapter 2, Case Study: The Exception That Grounded an Airline, that there is no hope of preventing every possible error. Dimensions proliferate and the state space exponentiates. There’s just no way to test everything or predict all the ways a system can break. We must assume that errors will happen.

The key question is, “What do we do with the error?” Most of the time, we try to recover from it. That means getting the system back into a known good state using things like exception handlers to fix the execution stack and try-finally blocks or block-scoped resources to clean up memory leaks. Is that sufficient?

The cleanest state your program can ever have is right after startup. The “let it crash” approach says that error recovery is difficult and unreliable, so our goal should be to get back to that clean startup as rapidly as possible.

For “let it crash” to work, a few things have to be true in our system.

Limited Granularity

There must be a boundary for the crashiness. We want to crash a component in isolation. The rest of the system must protect itself from a cascading failure. In Erlang or Elixir, the natural boundary is the actor. The runtime system allows an actor to terminate without taking down the entire operating system process. Other languages have actor libraries, such as Akka for Java and Scala.[14] These overlay the actor model on a runtime that has no idea what an actor is. If you follow the library’s rules for resource management and state isolation, you can still get the benefits of “let it crash.” You should plan on more code reviews to make sure every developer follows those rules, though!

In a microservices architecture, a whole instance of the service might be the right granularity. This depends largely on how quickly it can be replaced with a clean instance, which brings us to the next key consideration.

Fast Replacement

We must be able to get back into that clean state and resume normal operation as quickly as possible. Otherwise, we’ll see performance degrade when too many of our instances are restarting at the same time. In the limit, we could have loss of service because all of our instances are busy restarting.

With in-process components like actors, the restart time is measured in microseconds. Callers are unlikely to really notice that kind of disruption. You’d have to set up a special test case just to measure it.

Service instances are trickier. It depends on how much of the “stack” has to be started up. A few examples will help illustrate that:

  • We’re running Go binaries in a container. Startup time for a new container and a process in it is measured in milliseconds. Crash the whole container.

  • It’s a NodeJS service running on a long-running virtual machine in AWS. Starting the NodeJS process takes milliseconds, but starting a new VM takes minutes. In this case, just crash the NodeJS process.

  • An aging JavaEE application with an API pranged into the front end runs on virtual machines in a data center. Startup time is measured in minutes. “Let it crash” is not the right strategy.

Supervision

When we crash an actor or a process, how does a new one get started? You could write a bash script with a while loop in it. But what happens when the problem persists across restarts? The script basically fork-bombs the server.

Actor systems use a hierarchical tree of supervisors to manage the restarts. Whenever an actor terminates, the runtime notifies the supervisor. The supervisor can then decide to restart the child actor, restart all of its children, or crash itself. If the supervisor crashes, the runtime will terminate all its children and notify the supervisor’s supervisor. Ultimately you can get whole branches of the supervision tree to restart with a clean state. The design of the supervision tree is integral to the system design.

It’s important to note that the supervisor is not the service consumer. Managing the worker is different than requesting work. Systems suffer when they conflate the two.

Supervisors need to keep close track of how often they restart child processes. It may be necessary for the supervisor to crash itself if child restarts happen too densely. This would indicate that either the state isn’t sufficiently cleaned up or the whole system is in jeopardy and the supervisor is just masking the underlying problem.

With service instances in a PaaS environment, the platform itself decides to launch a replacement. In a virtualized environment with autoscaling, the autoscaler decides whether and where to launch a replacement. Still, these are not the same as a supervisor because they lack discretion. They will always restart the crashed instance, even if it is just going to crash again immediately. There’s also no notion of hierarchical supervision.

Reintegration

The final element of a “let it crash” strategy is reintegration. After an actor or instance crashes and the supervisor restarts it, the system must resume calling the newly restored provider. If the instance was called directly, then callers should have circuit breakers to automatically reintegrate the instance. If the instance is part of a load-balanced pool, then the instance must be able to join the pool to accept work. A PaaS will take care of this for containers. With statically allocated virtual machines in a data center, the instance should be reintegrated when health checks from the load balancer begin to pass.

Remember This

Crash components to save systems.

It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.

Restart fast and reintegrate.

The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.

Isolate components to crash independently.

Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that crashes are isolated and don’t affect unrelated functionality.

Don’t crash monoliths.

Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features into a single process are also a poor choice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.137.93