Recovering from failure

So far, we have discussed isolating the failure and handling it. So now we are handling our failures gracefully, but what about fixing them? We cannot let the system remain in an error state forever. We will need to bring back our filed services and make sure our system is in a healthy state again.

Well, how exactly we recover from failures will depend on the system. How much manual effort would be required to fix the issues? For example, we cannot of course fix the code issue on the fly. But, we can take some steps to help and speed up the recovery process.

The most important tool we have for recovery from a failure is monitoring. Proper monitoring of our services would let us know about the services facing issues. We need to monitor whether a service is responding correctly or with error codes. We need to monitor logs for exceptions and errors. We need to monitor the hardware health of the nodes, such as memory and CPU usage. If the services are throwing too many errors, above an acceptable threshold, or other parameters such as memory or CPU usage are beyond acceptable levels, the monitoring script can trigger an action. An action could be as simple as raising a trigger to concerned teams by sending emails, messages, or escalations to restart nodes. The automated scripts can also take a call on whether we need to scale up the service by adding additional nodes. We will talk more about monitoring and scaling in chapters dedicated to these topics. Chapter 8, Monitoring Microservices and Chapter 6, Scaling Microservices.

Table of Contents for Recovering from failure

Create new playlist

Sign In

Sign Up

Table of Contents for
Recovering from failure