The Recovery Challenge

Proper error handling and recovery is the Achilles’ heel of many applications. When an application fails to perform a particular operation, you should recover from it and restore the system—that is, the collection of interacting services and clients—to a consistent state (usually, the state the system was at before the operation that caused the error took place). Operations that can fail typically consist of multiple potentially concurrent smaller steps. Some of those steps can fail while others succeed. The problem with recovery is the sheer number of partial success and partial failure permutations that you have to code against. For example, an operation comprising 10 smaller concurrent steps has some three million recovery scenarios, because for the recovery logic, the order in which the suboperations fail matters as well, and the factorial of 10 is roughly three million.

Trying to handcraft recovery code in a decent-sized application is often a futile attempt, resulting in fragile code that is very susceptible to any changes in the application execution or the business use case, incurring both productivity and performance penalties. The productivity penalty results from all the effort required for handcrafting the recovery logic. The performance penalty is inherited with such an approach because you need to execute huge amounts of code after every operation to verify that all is well. In reality, developers tend to deal only with the easy recovery cases; that is, the cases that they are both aware of and know how to handle. More insidious error scenarios, such as intermediate network failures or disk crashes, go unaddressed. In addition, because recovery is all about restoring the system to a consistent state (typically the state before the operations), the real problem has to do with the steps that succeeded, rather than those that failed. The failed steps failed to affect the system; the challenge is actually the need to undo successful steps, such as deleting a row from a table, or a node from a linked list, or a call to a remote service. The scenarios involved could be very complex, and your manual recovery logic is almost certain to miss a few successful suboperations.

The more complex the recovery logic becomes, the more error-prone the recovery itself becomes. If you have an error in the recovery, how would you recover the recovery? How do developers go about designing, testing, and debugging complex recovery logic? How do they simulate the endless number of errors and failures that are possible? Not only that, but what if before the operation failed, as it was progressing along executing its suboperations successfully, some other party accessed your application and acted upon the state of the system—the state that you are going to roll back during the recovery? That other party is now acting on inconsistent information and, by definition, is in error too. Moreover, your operation may be just one step in some other, much wider operation that spans multiple services from multiple vendors on multiple machines. How would you recover the system as a whole in such a case? Even if you have a miraculous way of recovering your service, how would that recovery logic plug into the cross-service recovery? As you can see, it is practically impossible to write error-recovery code by hand.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.131.62