Isolating the failure

If you can take only one thing from this chapter, take this always design your system in a way that failure in one service or business area should not get propagated to other areas. In short, isolate the failure.

Years ago, plugging in a faulty electronic device at home could cause a house-wide power outage. It could even cause a short circuit or fire. Then came Miniature Circuit Breakers (MCB), so if something goes wrong, only a single MCB that pertains to the area where the faulty device was used would trip. This is a good example of isolating the failure; if something goes wrong in one area, it is not able to impact other areas. Not surprisingly, we have a pattern for failure handling in services, called the circuit breaker pattern, which we will discuss later in this chapter in the Handling the failure section.

What I am trying to emphasize here is that one should make sure that there is no single point of failure in the system, or that a failure in one service should not impact other services. One simple example of having a single point of failure is depending on a single database server. If the database server is down, the whole system is down. The idea behind Microservices is that each service should be looked at as an independent unit.

The following scenario depicts a case where, due to a failed common database node, all the services dependent on it have failed too:

Clearly, we have not implemented the Microservices architecture properly and are not taking advantage of using Microservices. Microservices are supposed to be built independently, in a manner that failure in one does not cause failure in the others.

Let's revisit the problem with an updated design:

We can see that, if properly isolated, a failure will only impact a single service. In the preceding diagram, we have isolated all four services, and a failure in the database for service three only impacts service three; the other services keep on working smoothly.

Failure isolation needs to be handled at multiple levels; we need to handle it at the service level itself. So if there is a buggy service, say a search service is buggy in an e-commerce site, it should not impact the checkout or catalog view. Additionally, we need to handle failure at the service instance level. For example, if we had five instances of a search service, and one of them goes down, it should not impact the other four instances.

When starting to build an application, you need to design for isolation. We need to make sure that even if a problem occurs, it is isolated and does not impact the system as a whole. In short, we need to avoid cascading failures. The age-old principle of Low Coupling and High Cohesion helps here. That is, only the features that are alike and must be put together should be together; otherwise, we should keep them separate.

We will now talk about concrete patterns and techniques for how to isolate a failure.

Table of Contents for Isolating the failure

Create new playlist

Sign In

Sign Up

Table of Contents for
Isolating the failure