Testing disaster recovery strategy

It is imperative to thoroughly test your recovery strategy end to end to ensure that it works and to iron out any kinks in the processes. However, testing for failures is hard, especially for sites or applications with very high request loads or complex functionality. Such web-scale applications usually comprise massive and rapidly changing datasets, complex interactions and request patterns, asynchronous requests, and a high degree of concurrency. Simulating complete and partial failures in such scenarios is a big challenge. At the same time, it is even more critical to regularly inject failures into such applications, under well-controlled circumstances, to ensure high availability for your end customers.

It is also important to be able to simulate increased latency or failed service calls. One of the biggest challenges in terms of testing services-related failures is that many times the services owners do not know all the dependencies or the overall impact of a particular service failure. In addition, such dependencies are in a constant flux. In these circumstances, it is extremely challenging to test service failures in a live environment at scale. However, a well thought out approach that identifies the critical services in your application takes into consideration prior service outages and having a good understanding of dependency interactions or implementing dynamic tracing of dependencies, can help you execute service failure test cases.

Simulating availability zone and/or region failures need to be executed with care, as you cannot shutdown an entire availability zone or region. However, you can shut down the components in an AZ via the console or use CloudFormation to shut down your resources at a region level. After the shutdown of the resources in the AZs of your primary region, you can launch your instances in the secondary region (from the AMIs) to test the DR site’s ability to take over. Another way to simulate region-level failures is to change the load balancer security group settings to block traffic. In this case, the Route 53 health checks will start failing and the failover strategy can be exercised.

For a deeper understanding of testing your deployment against failures, refer to Netflix's site for Chaos Monkey (responsible for randomly terminating instances in production to test resiliency to instance failures).

Table of Contents for Testing disaster recovery strategy

Create new playlist

Sign In

Sign Up

Table of Contents for
Testing disaster recovery strategy