Chapter 5. What Is Risk Management?

All complex systems have risk. It is an inevitable part of all systems. It is impossible to remove all risk from a complex system such as a web application. However, examining your risk and determining how much risk is acceptable is important in keeping your system healthy.

This chapter, provides an overview of what risk is and how we can identify it. It then introduces a process called risk management, which helps us to reduce the effect of risk on our applications.

Let’s now take a look at Example 5-1, which revisits the big game example from Chapter 1.

Example 5-1. Risk management of the big game

Here’s a brief synopsis of the big game example we looked at earlier: it’s Sunday—the day of the big game. You’ve invited friends over to watch it on your new TV. The game is about to start. And…the lights go out and the TV goes dark. The game, for you and your friends is over. You call the power company, and they say “We’re sorry, but we only guarantee 95% availability of our power grid.”

The power company in this example is taking a risk. They are risking that the power won’t go off during a big game.

They even have it quantified (95% likely power will stay on).

The power company knows the types of things that can cause power to go out, such as a power line breaking. As such, to ensure power lines won’t break, they will typically:

  • Bury them (to protect them from wind)

  • Harden them (to reduce the chance a wind storm can blow it down)

  • Put in redundant power systems (so one keeps working even if another is down)

But these strategies have a cost. Is it worthwhile investing in hardened power lines? Is it worth the cost to bury them? Is the cost of the risk worth the investment in reducing the risk? These types of questions are risk management questions, and these are the types of questions we will consider in this chapter.

Managing Risk

Risk management involves determining where the risk is within your system, determining which risks must be removed and which remain, and then mitigating the remaining risks to reduce their likelihood and severity.

When a risk triggers (or occurs), you or your system suffer a loss. This loss can be data lost by your company or a customer. It can be a lack of availability in your application by your customers. The loss can be invalid or erroneous results. Ultimately, any of these can result in your customers losing trust in your ability to manage their data and their business. This, ultimately, will cost you money.

However, you must weigh this loss against a competing aspect: what is the cost of removing the risk to prevent it from happening?

Ultimately, risk management is balancing the cost of removing a risk with the cost of having the risk occur.

Identify Risk

Your first step in managing risk is creating a list of all known risks, along with their severity and their likelihood of occurring.

We call this list a risk matrix, an example of which is shown in Figure 5-1.

Risk matrix template.
Figure 5-1. Example risk matrix

Creating the matrix initially involves brainstorming. You can get ideas for what to put in your risk matrix from multiple sources:

  • Collective wisdom of the developers

  • Known high-support areas

  • Known threat vectors or vulnerabilities

  • Known areas where the system is incomplete or missing capabilities

  • Known poor performance areas

  • Known traffic spikes and patterns

  • Specific concerns from business owners, support personnel, or users

  • Known technical debt in your system

You will likely find that there are obvious entries in the list, but there should also be entries that surprise you. This is good. You want to uncover as many of your risk vulnerabilities as possible, and if some of them don’t come as a surprise to you, you probably haven’t dug deep enough.

Creating the risk matrix involves assigning prioritized values for the likelihood of the risk occurring and the impact (severity) of the problem caused if the risk does occur.

We will discuss this list extensively in Chapter 7.

Remove Worst Offenders

After compiling your initial list, review it and identify the risk entries that are your worst unmitigated offenders. How do you know which risks are the worst offenders? Look for risks that occur often or risks that haven’t occurred yet but would cause serious problems to your system if they did. The absolute worst offenders are risks that are highly likely to occur or occur often and cause serious harm to your system. Chapter 6 discusses the difference between severity and likelihood, and how to use this information to help manage your risks. This information will help you find your worst offenders.

In Figure 5-1, an example risk that might be one of our worst offenders is “Frontend fails if user identity service is down.”

Once you’ve identified a few of the top offenders, add items to your roadmap to make sure these are addressed in a timely manner.

Mitigate

For all risks, whether they are the worst offenders or not, brainstorm if there are things you can do that will either reduce the frequency or likelihood of the risk occurring, or reduce the severity of the problem if the risk does occur. These things are called risk mitigators.

Risk mitigators can be highly valuable. You are especially looking for mitigators that will reduce the risk (either severity or likelihood or both) yet are simple or inexpensive to implement.

Let’s take a look at the risk “Frontend fails if user identity service is down” shown in Figure 5-1. For this risk, a potential mitigation to consider is to cache user identity information so that some information may be available for the frontend to use, even if the user identity service is down.

You can focus on your worst offenders, finding ways to reduce the severity of those risks. But also look at risks that you might not be able to fix any time soon. Finding mitigations to these risks which reduce the severity or likelihood can be nearly as valuable as fixing the risk altogether.

Review Regularly

The risk matrix can quickly become stale if you don’t review it regularly. You should review your risk matrix as a team at least quarterly, but perhaps monthly for very active and highly critical systems. Additionally, review it after each incident. Was the incident properly covered by a known risk?

When you review the matrix, first look for new risks that have been recently introduced or newly identified. Add new entries for these risks. Also, remove old entries for items that are no longer risks.

Then look for severity or likelihood changes. Often, mitigations were helpful and managed to reduce the severity or likelihood. Or, more knowledge has come forward that makes a risk either more likely to occur or perhaps more severe. This is frequently the case if a risk actually triggered since your last review; you might feel that a risk marked as a low likelihood that actually did occur should perhaps be restated as a risk with a higher likelihood. Now, are there risks that you can remove (fix) by putting them on your roadmap?

Finally, look for new or updated mitigations that you can put into play.

Managing Risk Summary

How do you manage risk in your systems? There are some basic steps to follow to accomplish it:

Identify risk

First, make a list of all your known risks in your system; this list is called a risk matrix. Prioritize the list.

Remove worst offenders

Find the biggest offenders in the list, and put a plan together to tackle them.

Mitigate

For the major risk items that you cannot remove, put together a mitigation plan to reduce the severity or likelihood of the risk from occurring.

Review regularly

Review your risk matrix regularly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.59.72