What is Resiliency?

On a cloud architecture (but in general on every IT system), you could have different events that could cause a failure of your solution. These failures could have a totally different nature, they could be hardware-related (a server goes down, a disk is corrupted and so on), they could be network-related (network glitches) or they could be data center-related (imagine a big problem on a data center or an Azure region).

When designing an architecture for the cloud, you need to have in mind that your solution could be affected by failures and you need to react to those failures, as soon as possible. This is the concept of Resiliency.

For a system, Resiliency (as a definition) is the ability to react to failures and continue to work. I've placed in evidence the word react because it's extremely important. You cannot totally avoid failures (some of them are totally unpredictable and not dependent on you), but you need to plan for a solution architecture that responds to failures in a quick way and permits you to guarantee as less downtime as possible, and no data loss.

Designing an application that is really resilient is not always so easy and it requires careful planning for and mitigating different types of failures that could occur. You need to identify possible failure points on your architecture and act as needed, in order to minimize them.

When planning for Resiliency of an architecture, there are three important metrics (business requirements) that you need to carefully evaluate:

Recovery time objective (RTO): The maximum acceptable time that an application can be unavailable after a failure
Recovery point objective (RPO): The maximum duration of data loss that is acceptable during a failure
Mean time to recover (MTTR): The average time needed to restore the application after a failure

A value for each of these metrics will determine a decision on how to plan your architecture. You need to carry out a careful failure analysis and plan for a solution that mitigates them. A useful checklist for Azure architectures is available here: https://docs.microsoft.com/en-us/azure/architecture/resiliency/failure-mode-analysis.

Resiliency can also be improved by carefully evaluating the Service Level Agreement (SLA) of each piece of your architecture (workloads).

For example, imagine having a business scenario where n remote applications send sales orders to a cloud-based web server (Azure VM) that collects them. Your business requirements dictate that your application must always be available for not missing orders that come from the remote sites.

In a first step, you can design the solution architecture like the following diagram:

Here, the remote site sends a sales order directly to the central website (Azure VM), that, for example, could have an SLA of 95%. If the connection to the central website fails for some reason, the order transmittal is broken and the remote sites will be blocked on their transmissions. This could be a problem for your business scenario.

You could improve the Resiliency of your architecture by modifying the solution, like in the following diagram:

Here, the remote site sends an order to the remote web server (cloud VM). If the order transmission fails for some reason, the remote site will redirect the order transmission to an Azure Queue (that has a high level of SLA). Here, the order is queued (so no data loss) and it will be redirected to the central website when is available again. In this second scenario, you don't have data loss and you can guarantee an improved business continuity.

Table of Contents for What is Resiliency?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is Resiliency?