Chapter 8. Risk Mitigation

The mitigation column in the risk matrix is used to show what mitigations can be, or are being used to reduce the severity, the likelihood, or both values for a given risk. It is all about taking a High/High risk1 and changing it to a High/Medium risk or a Medium/High risk.2 It is not about fixing the risk, only mitigating the severity or likelihood of the risk.

As described in “Mitigation Plan”, there is a basic process that you can follow for mitigating risks. A mitigation plan details the steps you are going to take (either immediately or in the near future) in order to reduce the likelihood or severity of the risk.

Risk mitigation is knowing what to do when a problem occurs so that you can reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and completely as possible, even when services and resources fail.

Let’s look at an example of a mitigation plan. Let’s assume that we have a database that is used for an application, such as the one described in Chapter 5. Let’s further assume that we already run the database on high-quality hardware with replicated components, such as using a RAID disk array, and server-grade redundant hardware. We believe our database is highly stable and highly available. On our risk matrix, we have the risk of a database failure as having a Low likelihood.

However, the database is still a single point of failure. If the database server fails (unlikely though that is), your entire system goes out of service. On our risk matrix, we would list this as a High severity.

This risk is a Low/High risk, and is very similar to the risk described in “The Order Database: Low Likelihood, High Severity Risk”.

What can we do to mitigate this risk? Well, one idea is to add multiple active database read replicates, and have them available on hot standby, as shown in Figure 8-1. If our main database server ever fails, having an active database standby ready to go will dramatically reduce the amount of time your system is down while the problem is being fixed. This reduces the severity of the risk, perhaps even making it a Low/Medium risk.

This is a mitigation plan.

DB Hot Standby for RIsk Mitigation.
Figure 8-1. Example database hot standby for risk mitigation

What’s the difference between risk mitigation and risk management? They are similar but different concepts:

Risk mitigation

Risk migitation is about reducing the impact of a risk by either reducing the likelihood that the risk will occur, or reducing the severity of the problem if the risk does occur.

Risk management

Risk management is understanding the play between removing risk and mitigating risk. It’s knowing whether it is prudent, timely, and cost effective to remove a risk or simply reduce the impact of the risk.

Recovery Plans

If a known risk does occur, you must deal with the consequences. You can use a recovery plan to create a known set of actions to take to deal with those consequences and repair the problem that the risk introduced. Recovery plans typically do not impact the likelihood, just the severity of a risk.

A recovery plan is a particular type of risk mitigation that specifically involves reducing the severity of the risk when it does occur. A recovery plan describes what you do if a known risk happens. A recovery plan can describe the following:

  • Actions to take to stop the problem as quickly as possible.

  • Actions to take to implement a workaround to reduce the impact of the problem.

  • Messages to inform customers of what the problem is and what they can do to reduce the impact on them.

  • Escalation processes to use and people within the company to inform about the problem. This lets all parts of the company understand and deal with the problem and any fallout.

A good recovery plan is constructed in advance, as part of the risk mitigation plan for a given risk, so that when a problem does occur (a risk is triggered) everyone knows what needs to happen to recover from the problem.

The recovery plan should contain:

  • Details of what must be happening that would trigger the recovery plan to be implemented.

  • The list of actors that need to be involved in implementing the recovery plan.

  • Step-by-step instructions for implementing the recovery plan, and which actor should execute those steps.

  • Management, escalation, and notifications that need to be informed.

  • Required follow-up that must happen after the problem is resolved.

The recovery plan should be stored in a well-known location to your team—that is, a place where everyone on your team will know to go during a crisis situation. This could be in a support book or an internal support intranet. After a recovery plan is executed, a postmortem of the problem should occur and the recovery plan should be analyzed to determine if any improvements or changes are warranted.

The simple existence of a valid recovery plan for a specific risk item is an example of a valid risk mitigation plan that you can use to reduce the severity of a given risk.

Recovery Plan

The replication process described in Figure 8-1 is the beginning of a recovery plan for the risk of catastrophic database failure. However, to be a complete recovery plan, you would also need to include a process for implementing failover, criteria for determining when the failover can occur, an approval process for implementing the failover, and postmortem cleanup after the failover.

Disaster Recovery Plans

A disaster recovery plan is an example of a recovery plan that is designed to describe what the company should do if a specific type of disaster hits the company. These types of disasters tend to have a severity of High but will typically have a likelihood of Low.

An example of a disaster that warrants a disaster plan is the loss of one or more data centers for your application (whether that is caused by technical issues, a natural disaster, or by a significant security breach).

You can create and manage disaster recovery plans just like recovery plans. The only real difference between a disaster recovery plan and a typical recovery plan is the seriousness of the risk they are mitigating and potentially the level of detail and involvement in implementing the plan.

Typically, disaster recovery plans have significantly more visibility within the company and the management and ownership of the company. There may be preestablished, business-specified recovery times required for these types of disasters. But this does not effectively distinguish them from recovery plans.

Improving Our Risk Situation

Risk mitigation is an important process in improving the availability and scalability of our applications by reducing the impact that risk plays in our application. It is a recognition that although removing a risk might not be possible or practical, reducing its impact or severity might very well be possible, and often is sufficient to give us the desired level of application health we desire.

When used in conjunction with a risk matrix, risk mitigation plans provide a useful tool to improve the health of your application.

1 See Chapter 6 for a full description of severity, likelihood, the definition of High/High, and so on.

2 Or lower any other combination, such as Medium/High to Medium/Medium, or Medium/Low to Low/Low.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.157.151