Chapter 7. The Risk Matrix

The first step in managing risk is understanding the risk that is already in your system. Identifying, labeling, and prioritizing your known risks is what the risk matrix is all about.

First introduced in Chapter 5, the risk matrix is a critical aspect of managing the risk in your system. It is a table that contains a living view of the state of all the known risk in your system.

Figure 7-1 contains an example risk matrix.

Risk matrix template.
Figure 7-1. Example risk matrix

Each row in the matrix represents a single, quantifiable risk that is present in your system. The columns in the spreadsheet contain the details of that specific risk item.

For each risk item the following information is kept:

Risk ID

This is a unique identifier assigned to the risk. It can be anything, but a unique integer identifier is usually the easiest and is sufficient.1

System

This is the name of the system, subsystem, or module that contains the risk. This information is dependent on the specifics of your application, but it could be things like “FrontEnd,” “PrimaryDb,” “ServiceA,” or similar.

Owner

The name of an individual (or team) who owns this risk and is responsible for mitigation plans and resolution plans.

Risk description

This is a summary description of the risk. It should be short enough to be easily scanned and recognized yet long enough to uniquely and accurately identify the risk.

Date identified

The date the risk was identified and added to the matrix.

Likelihood

This identifies the likelihood (low, medium, or high) of the risk occurring. This value is discussed in greater detail in Chapter 6. You will use this value to sort your risk matrix to determine which ones to be the most concerned with and which ones require the most immediate attention.2

Severity

This is the severity or impact (low, medium, or high) of the risk occurring. This value is discussed in greater detail in Chapter 6. You will use this value to sort your risk matrix to determine which ones to be the most concerned with and which ones require the most immediate attention.

Mitigation plan

This column provides a description of any migitations that can be used, or are being used, to reduce the severity or likelihood of this risk.

Status

This column indicates what the status of the risk is. This is typically something like “active,” “mitigated,” “fix in progress,” or “resolved.”

ETA

This is the estimated time for when the final resolution for this risk is planned (if known).

Monitoring

This column indicates whether you are monitoring for this risk to occur, and if so, the steps you’ve taken to accomplish this. If you are not monitoring the risk, you should indicate why and estimate a date for when you will be able to do so.

Triggered plan

If this risk does occur, what is your plan for dealing with it? The triggered plan is usually a management-level plan rather than an incident-response plan.3

Comment

Use this column for any other information about this risk that doesn’t fit or doesn’t belong in the risk description.

Additionally, other values that are important to your organization can be added as you see fit. For example:

Tracking ID

If you have a bug tracking or roadmap tracking system that contains an entry for this risk, you can put the bug or roadmap tracking ID number in this field.

History

Has this risk already triggered in the past? When? How often?

Scope of the Risk Matrix

At this point, you’re probably wondering “Should I have one risk matrix for the entire company, or one for each team or service?”

This is a good question. One matrix for the entire company is fine for a small company, but it can quickly become unwieldy. One per service affords good visibility at the service level, but reduced visibility at the company level. Questions such as “which service has the most significant risk to the company?” become hard to answer.

I recommend one risk matrix per team. Because decision making on what features or issues to work on and their priority is often handled at the team level, it makes sense for the risk matrix to be managed and prioritized at the team level. You can find more information about team level management in Chapter 15.

Bottom line, you should scope your risk matrices as makes sense for your organization. As such, one risk matrix should be used for each team, group, or organization that typically manages its own decisions about work scoping and prioritization. They may receive input and guidance from upper management, but most work is prioritized and executed at this organizational level.

Creating the Risk Matrix

First, begin with one of the risk matrix templates. We have created some for you in the most popular spreadsheet programs. They are available for download on our website. Although you are free to customize it as needed, for your first risk matrix, you should stick as closely as possible to this template. After you have some experience using the matrix and managing risks, you can customize as you see fit.

The template has an example risk on it to demonstrate how you might use it. Feel free to delete that before continuing.

Risk matrix template.
Figure 7-2. Downloaded risk matrix template

Brainstorming the List

When you have your template ready, your first step is to brainstorm a list of the risks you feel should be included. Try to include any risk you can think of, not just those with which you are concerned. Don’t analyze them during this process—just brain dump all that you can think of.

There are several good sources of insights for this brainstorm:

Dev team

Have a meeting with your development team. The team members will have an amazingly large number of worries on their mind about their services. Listen to their concerns, and add risk items for each one that comes up.

Support

Look at your support volume. Are there areas where you are seeing a higher than normal support load? What do your support people say? Do you have support forums you can review? High support areas are a common source of system risks.

Threat vectors

Think about known threat vectors and security vulnerabilities. Each of these, no matter how big or how important, is a risk to your service.

Feature backlog

Go through your feature backlog. Are there capabilities of your system that are missing that are critical to the health of your system? Look especially for monitoring and maintenance-related backlog items.

Performance

Think about the performance of your system. Are there areas you are aware of that have poor performance?

Business owners

Talk to your business owners. What concerns do they have?

Extended team

Talk to your extended team, including internal users, dependent teams, Q/A staff, and so on. What concerns do they have?

Systems and processes

Do you have documented systems and processes in place? Are there places where necessary documentation for how your application functions is missing, or perhaps is held only in the heads of a few individuals?

Technical debt

Do you have known, specific technical debt in your system? Examples of technical debt include areas of code that are hard to understand or are more complicated or have more moving pieces than are necessary. Areas of known technical debt are almost always risk items.

You will likely find that there are obvious entries in the list, but there should also be entries that surprise you. This is good. You want to uncover as many of your risk vulnerabilities as possible, and if none of them come as a surprise to you, you probably haven’t dug deep enough.

Set the Likelihood and Severity Fields

Now, go through the list and set the likelihood and severity field for each item. Use Low/Medium/High (or similar variation) values for each of the two fields.

Make sure to keep the concepts of likelihood and severity distinct in your mind. Refer to Chapter 6 if necessary. It is often very easy to confuse them as you are working on this step.

It might be helpful to go through and set likelihood first, and then go back and set severity for each item. Remember, it’s quite normal for a risk item to be very severe if it occurs, but almost impossible to occur (or, alternatively, very common to occur but not very important if it does occur). You will end up with items in all combinations of H/H, H/L, L/H, and L/L state. This is normal and expected.

However you decide to do this task, you will not end up with a meaningful list if you confuse these two values.

Another brainstorming session with your development team is a great way to accomplish this task. This should be a distinct brainstorming session from the aforementioned session, which identifies the risks. Don’t label them at the same time that you identify them.

Risk Item Details

Now, fill in the other basic details of the risk matrix. This includes things like System, Owner, Date Identified, and Status.

Make sure to assign a risk ID to each item (a simple numbering from 1…n is reasonable).

Are you monitoring for this risk? Indicate in the Monitoring field whether you have the ability to be notified if this risk is triggered.

Mitigation Plan

Starting with the highest severity items first, begin to put together mitigation plans for each item. Then move on to the highest likelihood items.

A mitigation plan is a plan for how you are going to, now or in the near future, put in changes that are designed to either reduce the severity of the risk or reduce the likelihood. A mitigation plan is not designed to remove the risk—instead, it simply reduces the severity or likelihood.

After you perform the steps indicated in the mitigation plan, it will be expected that the severity or likelihood will reduce, and this mitigation plan will be removed. A new mitigation plan can be introduced, if appropriate.

You do not need a mitigation plan for every item in the matrix. There might be items that clearly must be fixed and cannot be mitigated. Additionally, Low Likelihood/Low Severity items do not need to be mitigated.

Triggered Plan

A triggered plan is a plan for what you are going to do if the risk actually occurs. This can be something as simple as “fix the bug,” but it can also be more elaborate. For instance, if a risk occurs, are there tasks you can take right then that will reduce the impact? If so, they should be elaborated as part of the triggered plan.

Starting with the items with the highest severity first, begin to put together trigger plans for each item.

Note

Note that the triggered plan should not be seen as a replacement for incident-response documentation, such as playbooks. The risk matrix should not be a tool that must be consulted during an incident response. Instead, the matrix (including the triggered plan) should be a tool used by management to determine follow-up actions for the risk occurring.

Using the Risk Matrix for Planning

After your risk matrix is created, it should be consulted during all planning sessions. This includes long-range planning sessions with product management, but also SCRUM-level planning sessions with your engineers.

During every planning session, the most critical risks4 should be examined. The following questions should be asked:

  • Is this risk worse now than the last time I examined it?

  • Should we schedule work during this planning period to remove (fix) risks in our system?

  • Should we schedule work during this planning period to mitigate risks in our system, and hence reduce their likelihood or severity?

Every planning session should include a review of the risk matrix, and items on the risk matrix (either fixing risks or mitigating risks) should be included in your work prioritization process.

If your team makes use of a tool such as Jira or Pivotal Tracker during your planning sessions, you might want to add items in your tracking tool for the most critical risks. If you do that, you should refer to the Risk ID of the corresponding risk in your tracking tool item, and also add a Tracking ID column to your risk matrix to store the ID of the item from your tracking tool.

Maintaining the Risk Matrix

The biggest challenge with the risk matrix is that it is very easy for the matrix to become stale. Your natural tendency is to create the matrix, and then put it into a drawer and forget about it.

If you do not take time to maintain your risk matrix, it will rapidly become out of date and useless.

To keep your risk matrix up to date and accurate, you should schedule regular reviews of the matrix with the appropriate stakeholders, including your development team and partners. This can be monthly, but it should not be any less often than quarterly. The exact recurrence cycle depends on your business processes. If you have a planning cycle starting soon, performing a regular review of the matrix before that process begins is ideal.

Risk Matrix Review Attendees

Note that it is useful to change your risk matrix reviewers regularly. By requiring different individuals to review and comment on your risk matrix, you’ll get a fresh perspective and the review will be less likely to turn into a “same old rut” type of meeting.

During this review, you need to:

Look for new risks

Have there been new risks added to your system or recently identified? Make sure these are captured on the risk matrix.5

Remove old risks

Are there risks on the matrix that no longer apply—either because they can’t occur anymore or because the underlying cause has been fixed? If so, remove these.

Update likelihood and severity

Look for likelihood and severity changes. Often, recently implemented mitigations were helpful in reducing the likelihood or severity, or additional information has been gathered that will warrant a change in the likelihood or severity status. Make these updates.

Review top risks

Review all the risks that are either high likelihood or high severity (or both). Discuss these specific risks individually and make sure all the information is correct for them. Are there new or updated mitigation plans that can be put in place? What about triggered plans? Are you monitoring the risk? If not, why not? What else can you do to improve your situation with these risks?

Review less critical risks

Keep going down the likelihood and severity curve, looking at less critical risks as time permits. You do not have to review every risk every time, but make sure the top risks are all looked at often. In addition, you might want to schedule a session to examine in detail the less critical risks, just so they don’t get ignored and to make sure there aren’t hidden or missed reasons why they should be ranked higher on your list.

1 The ID should not be the row number in the spreadsheet, however. This is because the rows in the matrix will likely be sorted and new ones added and removed, thus changing the spreadsheet row number for a risk. The Risk ID should be an identifier that does not change for the life of the tracking of the risk.

2 To ease column sorting by the Likelihood and Severity values, you might want to make them numeric, 1–3 for Low to High, or some other way. A common sorting trick is to use “1-Low,” “2-Medium,” “3-High,” and then use your spreadsheet program’s capabilities to restrict the values allowed to just these three.

3 Incident response plans should be readily available to your oncall personnel in your incident playbook or other tools.

4 The most critical risks are those risks with the highest likelihood, the highest severity, or especially items for which both likelihood and severity are high.

5 However, we recommend that the moment you believe you have identified a new risk, you add it to the matrix. Don’t wait for a review session. You can wait for the review session to update all the data in the risk, but you should document it immediately once discovered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.178