Design pattern for resiliency

Resiliency is the ability of a system to gracefully handle and recover from failures. It is one of the most important factors when designing services so that it can recover either its high load, or the failure of internal or external components, during any condition:

  • Building services: In microservices and implementing microservice design principle discussed in this book.
  • Retry logic: It helps an application to handle anticipated, temporary failures if any endpoint or transaction fails, and it helps to retry the same transactions to recover from a specific issue.
  • Supervisor agent: Installs a supervisor-like service that continuously monitors your application/daemons/services and restarts them if they fail. SupervisorD is an available free tool that companies use to recover their services, as it helps to configure multiple services monitoring using a simple configuration.
  • Health monitoring: Each application/service should expose its health status so that it can be monitored using an external tool so that an action can be taken to notify and recover automatically. Most web applications expose/health. (php,aspx)-like pages that are continuously getting monitored by systems like AWS ELB and provides you health status mechanism to check instant status before forwarding traffic to registered nodes. Some of the external monitoring systems such as Runscope, AlertSite, and Webmetric systems monitor specific strings or HTTP 200 status codes to get the health status of your application.
  • Circuit breaker: To protect your application from cascading failures these days, we are implementing logic that blocks traffic or send and delay messages to a connecting app that protects your application by blocking traffic to your downstream components. For example, if you already know that you are hitting your peak limit of scalability and you can't handle more traffic than if you send back a delay message to your agent if agents are provided by you, but if agents are provided by you, then you just block such traffic.
  • Queue-based system: To handle temporary failures or to handle asynchronous loads, we are using a queue-based mechanism to handle traffic to avoid any failure of your service and to help with parallel processing without spiking your microservice CPU or memory. AMPQ implementations in the form of RabbitMQ are used in an industry for such a queuing system where multiple microservices listening to channels are configured under exchanges.
  • Compensating transaction: There are many conditions when your logic fails in the middle during execution and you should implement functionality to gracefully roll back a transaction to its original state.
  • Master or leader selection: There are many systems in a distributed environment or in a microservices world, where an infrastructure component assumes a master or a leader role so that functionality can be made compliant, according to guidelines. For example, Docker swarm uses a manager node, and Kubernetes uses a master node to manage other worker nodes.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.150.80