Stopping Crack Propagation

Let’s see how the design of failure modes applies to the grounded airline from before. The airline’s Core Facilities project had not planned out its failure modes. The crack started at the improper handling of the SQLException, but it could have been stopped at many other points. Let’s look at some examples, from low-level detail to high-level architecture.

Because the pool was configured to block requesting threads when no resources were available, it eventually tied up all request-handling threads. (This happened independently in each application server instance.) The pool could have been configured to create more connections if it was exhausted. It also could have been configured to block callers for a limited time, instead of blocking forever when all connections were checked out. Either of these would have stopped the crack from propagating.

At the next level up, a problem with one call in CF caused the calling applications on other hosts to fail. Because CF exposed its services as Enterprise JavaBeans (EJBs), it used RMI. By default, RMI calls will never time out. In other words, the callers blocked waiting to read their responses from CF’s EJBs. The first twenty callers to each instance received exceptions: a SQLException wrapped in an InvocationTargetException wrapped in a RemoteException, to be precise. After that, the calls started blocking.

The client could have been written to set a timeout on the RMI sockets. For example, it could have installed a socket factory that calls Socket.setSoTimeout on all new sockets it creates. At a certain point in time, CF could also have decided to build an HTTP-based web service instead of EJBs. Then the client could set a timeout on its HTTP requests. The clients might also have written their calls so the blocked threads could be jettisoned, instead of having the request-handling thread make the external integration call. None of these were done, so the crack propagated from CF to all systems that used CF.

At a still larger scale, the CF servers themselves could have been partitioned into more than one service group. That would have kept a problem within one of the service groups from taking down all users of CF. (In this case, all the service groups would have cracked in the same way, but that would not always be the case.) This is another way of stopping cracks from propagating into the rest of the enterprise.

Looking at even larger architecture issues, CF could’ve been built using request/reply message queues. In that case, the caller would know that a reply might never arrive. It would have to deal with that case as part of handling the protocol itself. Even more radically, the callers could have been searching for flights by looking for entries in a tuple space that matched the search criteria. CF would have to have kept the tuple space populated with flight records. The more tightly coupled the architecture, the greater the chance this coding error can propagate. Conversely, the less-coupled architectures act as shock absorbers, diminishing the effects of this error instead of amplifying them.

Any of these approaches could have stopped the SQLException problem from spreading to the rest of the airline. Sadly, the designers had not considered the possibility of “cracks” when they created the shared services.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.98.207