Fail Fast

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Fail Fast

If slow responses are worse than no response, the worst must surely be a slow failure response. It’s like waiting through the interminable line at the DMV, only to be told you need to fill out a different form and go back to the end of the line. Can there be any bigger waste of system resources than burning cycles and clock time only to throw away the result?

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.

How can the system tell whether it will fail? Do we need Deep Learning? Don’t worry, you won’t need to hire a cadre of data scientists.

It’s actually much more mundane than that. There’s a large class of “resource unavailable” failures. For example, when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

One of my more interesting projects was for a studio photography company. Part of the project involved working on the software that rendered images for high-resolution printing. The previous generation of this software had a problem that generated more work for humans downstream: if color profiles, images, backgrounds, or alpha masks weren’t available, it “rendered” a black image full of zero-valued pixels. This black image went into the printing pipeline and was printed, wasting paper, chemicals, and time. Quality checkers would pull the black image and send it back to the people at the beginning of the process for diagnosis, debugging, and correction. Ultimately, they would fix the problem (usually by calling developers to the printing facility) and remake the bad print. Since the order was already late getting out the door, they would expedite the remake—meaning it interrupted the pipeline of work and went to the head of the line.

When my team started on the rendering software, we applied the Fail Fast pattern. As soon as the print job arrived, the renderer checked for the presence of every font (missing fonts caused a similar remake, but not because of black images), image, background, and alpha mask. It preallocated memory, so it couldn’t fail an allocation later. The renderer reported any such failure to the job control system immediately, before it wasted several minutes of compute time. Best of all, “broken” orders would be pulled from the pipeline, avoiding the case of having partial orders waiting at the end of the process. Once we launched the new renderer, the software-induced remake rate dropped to zero. Orders could still be remade because of other quality problems—dust in the camera, poor exposure, or bad cropping—but at least our software wasn’t the cause.

The only thing we didn’t preallocate was disk space for the final image. We violated “steady state” under the direction of the customer, who indicated that he had his own rock-solid purging process. Turns out the “purging process” was one guy who occasionally deleted a bunch of files by hand. Less than one year after we launched, the drives filled up. Sure enough, the one place we broke the Fail Fast principle was the one place our renderer failed to report errors before wasting effort. It would render images—several minutes of compute time—and then throw an exception.

Another way to fail fast in a web application is to perform basic parameter-checking in the servlet or controller that receives the request, before talking to the database. This would be a good reason to move some parameter checking out of domain objects into something like a “Query object.”

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.

The Fail Fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, failing fast can help avert impending cascading failures. It also helps maintain capacity when the system is under stress because of partial failures.

Remember This

Avoid Slow Responses and Fail Fast.: If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.
Reserve resources, verify Integration Points early.: In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.
Use for input validation.: Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Fail Fast

Create new playlist

Sign In

Sign Up

Fail Fast

Remember This

Table of Contents for
Fail Fast