Chapter 16. Determining Risk

Hence in the wise leader’s plans, considerations of advantage and disadvantage will be blended together.

—Sun Tzu

While we’ve often mentioned risk management, we have not offered our view of risk and how to manage it. This chapter broadly covers how to determine and manage risk in any technology or business decision. Managing risk is one of the most fundamentally important aspects of increasing and maintaining availability and scalability.

Importance of Risk Management to Scale

Business is inherently a risky endeavor. Some examples of risks are that your business model will not work or will become outdated, that customers won’t want your products, that your products will be too costly, or that your products will fall out of favor with customers. To be in business, you must be able to identify and balance the risks with the associated rewards. Pushing a new release, for instance, clearly has inherent risks, but it should also have rewards.

Most organizations look at risk and focus on reducing the probability of an incident occurring. Introducing more testing into the development life cycle is the typical approach to reducing risk. The problem with this approach is that testing can limit the probability of a bug or error to only a finite and non-zero amount. Testing over extended periods of time, such as is done with satellite software, still does not ensure bug-free code. Recall that on September 23, 1999, NASA lost a $125 million Mars orbiter because a Lockheed Martin engineering team used English units of measurement while NASA’s team used the more conventional metric system for a key spacecraft operation. How was this discrepancy not caught during testing? The answer, as quality professionals know, is that it is mathematically impossible to ensure defect-free software for even moderately complex systems.

At AKF Partners, we prefer to view risk as a multifactor product, depicted in Figure 16.1. We divide risk into the probability of an incident occurring and the impact of an incident should it occur. The probability of an incident is driven partly by the amount of change involved in any release and the amount of testing performed on that change. The greater the effort or amount of change, the higher the probability of an incident. The more testing that we do, the lower the probability that we will have an incident. Key efforts to drive down the probability of incidents should include smaller releases and more effective testing (such as through thorough automation).

Image

Figure 16.1 Risk Model

Impact, in contrast, is driven by the breadth of the incident (measured by percentage of customers impacted or percentage of transactions) and the duration of the incident. As Figure 16.1 shows, architectural decisions such as fault isolation and x-, y-, and z-axis splits (all covered in Part III, “Architecting Scalable Solutions”) help drive down the breadth of impact. Effective monitoring, as well as problem and incident management, helps reduce the duration of an impact. We covered these areas in Part I, “Staffing a Scalable Organization.”

If business is inherently risky, are successful companies then simply better at managing risk? We think that the answer is that these companies are either effective at managing risk or have just been lucky thus far. And luck, as we all know, runs out sooner or later.

Being simply “lucky” should have you worried. One can argue that risk demonstrates a Markov property, meaning that the future states are determined by the present state and are independent of past states. We would argue that risk is cumulative to some degree, perhaps with an exponential decay but still additive. A risky event today can result in failures in the future, either because of direct correlation (e.g., today’s change breaks something else in the future) or via indirect methods (e.g., an increased risk tolerance by the organization leads to riskier behaviors in the future). Either way, actions can have near- and long-term consequences.

Some people can naturally feel and manage risk. These people may have developed this skill from years of working around technology. They also might just have an innate ability to sense risk. While having such a person is great, such an individual still represents a single point of failure; as such, we need to develop the rest of the organization to better measure and manage risk.

Because risk management is important to scalability, we need to understand the components and steps of the risk management process. There are many ways to go about trying to accurately determine risk—some more involved than others, and some often more accurate than others. The important thing is to select the right process for your organization, which means balancing the rigor and required accuracy with what makes sense for your organization. After estimating the amount of risk, you must actively manage both the acute risk and the overall risk. Acute risk is the amount of risk associated with a particular action, such as changing a configuration on a server. Overall risk is the amount that is cumulative within the system because of all the actions that have taken place over the previous days, weeks, or possibly even months.

Measuring Risk

The first step in being able to manage risk is to—as accurately as necessary—determine what amount of risk is involved in a particular action. The reason we use the term necessary and not possible is that you may be able to more accurately determine risk, but it might not be necessary given the current state of your product or your organization. For example, a product in beta testing, where customers expect some glitches, may dictate that a sophisticated risk assessment is not necessary and that a cursory analysis is sufficient at this point. There are many different ways to analyze, assess, or estimate risk. The more of these approaches that are in your tool belt, the more likely you are to use the most appropriate one for the appropriate time and activity. Here, we will cover three methods of determining risk. For each method, we will discuss its advantages, disadvantages, and level of accuracy.

The first approach is the gut feel method. People often use this method when they believe they can feel risk, and are given the authority to make important decisions regarding risk. As we mentioned earlier, some people inherently have this ability, and it is certainly great to have someone like this in the organization. However, we would caution you about two very important concerns. First, does this person really have the ability to understand risk at a subconscious level, or do you just wish he did? In other words, have you tracked this person’s accuracy? If you haven’t, you should do so before you consider this as anything more than a method of guessing. If you have someone who claims to “feel” the risk level, have that person write his or her predictions on the team whiteboard . . . for fun. Second, heed our prior warning about single points of failure. You need multiple people in your organization to understand how to assess risk. Ideally, everyone in the organization will be familiar with the significance of risk and the methodologies that exist for assessing and managing it.

The key advantage of the gut feel method of risk assessment is that it is very fast. A true expert who fundamentally understands the amount of risk inherent in certain tasks can make decisions in a matter of a few seconds. The disadvantages of the gut feel method include that, as we discussed, the person might not have this ability but may be fooled into thinking he does because of a few key saves. Another disadvantage is that this method is rarely replicable. People tend to develop this ability over years of working in the industry and honing their expertise; it is not something that can be taught in an hour-long class. Yet another disadvantage of this method is that it leaves a lot of decision making up to the whim of one person as opposed to a team or group that can brainstorm and troubleshoot data and conclusions. The accuracy of this method is highly variable depending on the person, the action, and a host of other factors. This week a person might be very good at assessing the risk, but next week she might strike out completely. As a result, you should use this method sparingly, when time is of the essence, risk is at worst moderate, and a time-proven expert is available.

The second method of measuring risk is the traffic light method. In this method, you determine the risk of an action by breaking down the action into the smallest components and assigning a risk level to them of green, yellow, or red. The smallest component could be a feature in a release or a configuration change in a list of maintenance steps. The granularity depends on several factors, including the time available and the amount of practice the team has in performing these assessments. Next we determine the overall or collective risk of the action. Assign a risk value to each color, count the number of each color, and multiply the count by the risk value. Then, sum these multiplied values and divide by the total count of items or actions. Whatever risk value the result is closest to gets assigned the overall color. Figure 16.2 depicts the risk rating of three features that provides a cumulative risk of the overall release.

Image

Figure 16.2 Traffic Light Method of Risk Assessment

Someone who is intimately familiar with the micro-level components should assess risk values and assign colors for individual items in the action. This assignment should be made based on the difficulty of the task, the amount of effort required for the task (the more effort, generally the higher the risk), the interaction of this component with others (the more connected or centralized the item, the higher the risk), and so on. Table 16.1 shows some of the most common attributes and their associated risk factors that can be used by engineers or other experts to gauge the risk of a particular feature or granular item in the overall list.

Image

Table 16.1 Risk–Attribute Correlation

One significant advantage of the traffic light method is that it begins to become methodical, which implies that it is repeatable, able to be documented, and able to be trained. Repeatability, in turn, implies that we can learn and improve upon the results. Many people can conduct the risk assessment, so you are no longer dependent on a single individual. Again, because many people can perform the assessment, there can be discussion about the decisions that people arrive at, and as a group they can decide whether someone’s argument has merit. The disadvantage of this method is that it takes more time than the gut feel method and is an extra step in the process. Another disadvantage is that it relies on each expert to choose which attributes he or she will use to assess the risk of individual components. Because of this possible variance among the experts, the accuracy of this risk assessment is mediocre. If the experts are very knowledgeable and have a clear understanding of what constitutes risky attributes for their particular area, the traffic light method can be fairly accurate. If they do not have a clear understanding of which attributes are important to examine when performing the assessment, the risk level may be off by quite a bit. We will see in the discussion of the next risk assessment methodology how this potential variance can be fixed, allowing the assessments to be more accurate.

The third method of assessing the amount of risk in a particular action is failure mode and effects analysis (FMEA). This methodology was originally developed for use by the military in the late 1940s.2 Since then, it has been used in a multitude of industries, including automotive, manufacturing, aerospace, and software development companies. The method of performing the assessment is similar to the traffic light method, in that components are broken up into the smallest parts that can be assessed for risk; for a release, this could be features, tasks, or modules. Each of these components is then identified with one or more possible failure modes. Each failure mode has an effect that describes the impact if this particular failure were to occur.

2. Procedure for performing a failure mode effect and criticality analysis. November 9, 1949. United States Military Procedure, MIL-P-1629.

For example, a signup feature may fail by not storing the new user’s information properly in the database, by assigning the wrong set of privileges to the new user, or through several other failure scenarios. The effect would be the user not being registered or having the ability to see data she was not authorized to see. Each failure scenario is scored on three factors: likelihood of failure, severity of that failure, and the ability to detect if that failure occurs (note the similarity to the risk model identified in Figure 16.1). We choose to use a scoring scale of 1, 3, and 9 for these elements because it allows us to be very conservative and differentiate items with high risk factors well above those items with medium or low risks. The likelihood of failure is essentially the probability of this particular failure scenario coming true. The severity of the failure is the total impact to the customer and the business if the failure occurs. This impact can take the form of monetary losses, loss of reputation (goodwill), or any other business-related measurement. The ability to detect the failure rates whether you will be likely to notice the failure if it occurs. As you can imagine, a very likely failure that has disastrous consequences and that is practically undetectable is the worst possible outcome.

After the individual failure modes and effects have been scored, the scores are multiplied to provide a total risk score—that is, likelihood score × severity score × ability to detect score. This score shows the overall risk that a particular component has within the overall action.

The next step in the FMEA process is to determine the mitigation steps that you can perform or put in place that will lower the risk of a particular factor. For instance, if a component of a feature had a very high ability to detect score, meaning that it would be hard to notice if the event occurred, the team might decide ahead of time to write some queries to check the database every hour after the product or service release for signs of this failure, such as missing data or wrong data. This mitigation step has a lowering effect on this risk factor of the component and should then indicate what the risk was lowered to.

In Table 16.2, there are two example features for a human resources management (HRM) application: a new signup flow for the company’s customers and changing to a new credit card processor. Each of these features has several failure modes identified. Walking through one as an example, let’s look at the Credit Card Payment feature and focus on the Credit Card Billed Incorrectly failure mode, which has the effect of either a too-large or too-small payment being charged to the card. In our example, an engineer might have scored this as a 1, or very unlikely to occur. Perhaps this feature received extensive code review and quality assurance testing due to the fact that it was dealing with credit cards, which lowered the risk. The engineer also feels that this failure mode has a disastrous severity, so it receives a 9 for this score. This seems reasonable because a wrongly billed credit card would result in angry customers, costly chargebacks, and potential refunding of license fees. The engineer feels that the failure mode would be of moderate difficulty to detect and so gives it a score of 3 on this variable. The total risk score for this failure mode is 27, arrived at by multiplying 1 × 3 × 9. A remediation action has been identified for this feature set—rolling out the payment processor in beta testing for a limited customer set. Doing so will reduce the severity because the breadth of customer impact will be limited. If this remediation action is taken, the risk would be lowered to a 3 for severity and the revised risk score would be only 9, much better than before.

Image

Table 16.2 Failure Mode and Effect Analysis Example

The advantage of FMEA as a risk assessment process is that it is very methodical, which allows it to be documented, trained, evaluated, and modified for improvement over time. Another advantage is the accuracy. Especially over time as your team becomes better at identifying failure scenarios and accurately assessing risks, this approach will become the most accurate way for you to determine risk. The disadvantage of the FMEA method is that it takes time and thought. The more time and effort put into this analysis, however, the better and more accurate the results. This method is very similar and complementary to test-driven development. Performing FMEA in advance of development allows us to improve designs and error handling.

As we will discuss in the next section, risk measurement scores, especially ones from FMEA, can be used to manage the amount of risk in a system across any time interval or in any one release/action. The next step in the risk assessment is to have someone or a team of people review the assessment for accuracy and to question any decision. This is the great part about using a methodical approach such as FMEA: Everyone can be trained on it and, therefore, can police each other to ensure the highest-quality assessment is performed. The last step in the assessment process is to revisit the assessment after the action has taken place to see how accurate you and the experts were in determining the right failure modes and assessing their factors. If a problem arose that was not identified as possible, have that expert review the situation in detail and provide a reason why this potential scenario was not identified ahead of time and a warning to other experts to watch out for this type of failure.

Managing Risk

We fundamentally believe that risk is cumulative. The greater the unmitigated risks you take, the more likely you are to experience significant problems. We teach our clients to manage both acute and overall risk in a system. Acute risk is how much risk exists from a single change or combination of changes in a release. Overall level of risk represents the accumulation of risk over hours, days, or weeks of performing risky actions on the system. Either type of risk—acute or overall—can result in a crisis scenario.

Acute risk is managed by monitoring the risk assessments performed on proposed changes to the system, such as releases. You may want to establish ahead of time some limits to the amount of risk that any one concurrent action can have or that you are willing to allow at a particular time of day or customer volume (review Chapter 10, Controlling Change in Production Environments, for more details). For instance, you may decide that any single action associated with a risk greater than 50 points, as calculated through the FMEA methodology, must be remediated below this amount or split into two separate actions. Alternatively, you may want only actions scoring below 25 points to take place on the system before midnight; everything higher must occur after midnight. Even though this is a discussion about the acute risk of a single action, this, too, is cumulative, in that the more risky items contained in a risk, the higher the likelihood of a problem, and the more difficult the detection or determination of the cause because so many things changed.

As a thought experiment, imagine a release with one feature that has two failure modes identified, compared to a release with 50 features, each with two or more failure modes. First, it is far more likely for a problem to occur in the latter case because of the number of opportunities. As an analog, consider flipping 50 pennies at the same time. While each coin has an independent probability of landing on heads, you are more likely to have at least one heads in the total results. Second, with 50 features, the likelihood of changes affecting one another or touching the same component, class, or method in an unexpected way is higher. Therefore, both from a cumulative opportunity standpoint and from a cumulative probability of negative interactions, there is an increased likelihood of a problem occurring in the 50-feature case compared to the one-feature case. If a problem arises after these releases, it will also be much easier to determine the cause of the problem when the release contains one feature than when it contains 50 features, assuming that all the features are somewhat proportional in complexity and size.

For managing acute risk, we recommend that you construct a chart such as the one in Table 16.3 that outlines all the rules and associated risk levels that are acceptable. This way, the action for each risk level is clear-cut. You should also establish an exceptions policy—for example, anything outside of these rules must be approved by the VP of engineering and the VP of operations or the CTO alone.

Image

Table 16.3 Acute Risk Management Rules

For managing overall risk, there are two factors you should consider. The first is the cumulative amount of changes that have taken place in the system and the corresponding increase in the amount of risk associated with each of these changes. Just as we discussed in the context of acute risk, combinations of actions can have unwanted interactions. The more releases or database splits or configuration changes that are made, the more likely that one will cause a problem or the interaction of them will cause a problem. If the development team has been working in a development environment with a single database and two days before the release the database is split into a master and a read host, it’s fairly likely that the next release will have a problem unless a ton of coordination and remediation work has been done.

The second factor that should be considered in the overall risk analysis is the human factor. As people perform increasingly more risky activities, their level of risk tolerance goes up. This human conditioning can work for us very well when we need to become adapted to a new environment, but when it comes to controlling risk in a system, it can lead us astray. If a saber-toothed tiger has moved into the neighborhood and you still have to leave your cave each day to hunt, the ability to adapt to the new risk in your life is critical to your survival. Otherwise, you might stay in your cave all day and starve. Conversely, adding risk because you have yet to be eaten and feel invincible is a good way to cause serious issues.

We recommend that to manage the overall amount of risk in a system, you adopt a set of rules such as those shown in Table 16.4. This table lays out the amount of risk, as determined by FMEA, for specific time periods. If you are using a different methodology than FMEA, you need to adjust the risk level column with some scale that makes sense. For instance, instead of “less than 150 points” you could use “fewer than 5 green or 3 yellow actions.” As in the acute risk management process, you will need to account for objections and overrides. You should plan ahead and have an escalation process established. One idea might be that a director can grant an extra 50 points to any risk level, a VP can grant 100 points, and the CTO can grant 250 points, but these additions are not cumulative. Any way you decide to set up this set of rules, what matters most is that it makes sense for your organization and that it is documented and strictly adhered to. As another example, suppose a feature release requires a major database upgrade. The combined FMEA score is 200, which exceeds the maximum risk for a feature release (150 points in Table 16.3). The CTO can then either approve the additional risk or else require the database upgrade to be done separately from the code release.

Image

Table 16.4 Overall Risk Management Rules

Conclusion

In this chapter, we focused on risk. Our discussions started by exploring the purpose of risk management and how it relates to scalability. We concluded that risk is prevalent in all businesses, but especially startups. To be successful, you have to take risks in the business world. In the SaaS world, scalability is part of this risk–reward structure. You must take risks in terms of your system’s scalability or else you will overbuild your system and not deliver products that will make the business successful. By actively managing your risk, you can increase the availability and scalability of your system.

Although many different approaches are used to assess risk, this chapter examined three specific options. The first method is gut feeling. Unfortunately, while some people are naturally gifted at identifying risk, many others are credited for but actually lack this ability.

The second method is the traffic light approach, which assessed components as low risk (green), medium risk (yellow), or high risk (red). The combination of all components in an action, release, change, or maintenance provides the overall risk level.

The third approach is the failure mode and effect analysis methodology—our recommended approach. In this method, experts are asked to assess the risk of components by identifying the failure modes that are possible with each component or feature and the impending effect that this failure would cause. For example, a credit card payment feature might fail by charging a wrong amount to the credit card, with the effect being a charge to the customer that is too large or too small. These failure modes and effects are scored based on their likelihood of occurrence, the severity if they were to occur, and the ability to detect if they did occur. These scores are then multiplied to provide a total risk score. Using this score, the experts would then recommend remediation steps to reduce the risk of one or more of the factors, thereby decreasing the overall risk score.

After assessing risk, we must manage it. This step can be broken up into the management of acute risk and the management of overall risk. Acute risk deals with single actions, releases, maintenances, and so on, whereas overall risk deals with all changes over periods of time such as hours, days, or weeks. For both acute and overall risk, we recommend adopting rules that specify predetermined amounts of risk that will be tolerated for each action or time period. Additionally, in preparation for objections, an escalation path should be established ahead of time so that the first crisis does not create its own path without thought and proper input from all parties.

As with most processes, the most important aspect of both risk assessment and risk management is the fit within your organization at this particular time. As your organization grows and matures, you may need to modify or augment these processes. For risk management to be effective, it must be used; for it to be used, it needs to be a good fit with your team.

Key Points

• Business is inherently risky; the changes that we make to improve scalability of our systems can be risky as well.

• Managing the amount of risk in a system is key to availability and ensuring the system can scale.

• Risk is cumulative, albeit with some degree of degradation occurring over time.

• For best results, you should use a method of risk assessment that is repeatable and measureable.

• Risk assessments, like other processes, can be improved over time.

• There are both advantages and disadvantages to all of various risk assessment approaches.

• There is a great deal of difference in the accuracy of various risk assessment approaches.

• Risk management can be viewed as addressing both acute risk and overall risk.

• Acute risk management deals with single instances of change, such as a release or a maintenance procedure.

• Overall risk management focuses on watching and administering the total level of risk in the system at any point in time.

• For the risk management process to be effective, it must be used and followed.

• The best way to ensure a process is adhered to is to make sure it is a good fit with the organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.124.53