Chapter 34

Summarizing Risk Management Processes and Concepts

This chapter covers the following topics related to Objective 5.4 (Summarize risk management processes and concepts) of the CompTIA Security+ SY0-601 certification exam:

  • Risk types

    • External

    • Internal

    • Legacy systems

    • Multiparty

    • IP theft

    • Software compliance/licensing

  • Risk management strategies

    • Acceptance

    • Avoidance

    • Transference

      • Cybersecurity insurance

    • Mitigation

  • Risk analysis

    • Risk register

    • Risk matrix/heat map

    • Risk control assessment

    • Risk control self-assessment

    • Risk awareness

    • Inherent risk

    • Residual risk

    • Control risk

    • Risk appetite

    • Regulations that affect risk posture

    • Risk assessment types

      • Qualitative

      • Quantitative

    • Likelihood of occurrence

    • Impact

    • Asset value

    • Single loss expectancy (SLE)

    • Annualized loss expectancy (ALE)

    • Annualized rate of occurrence (ARO)

  • Disasters

    • Environmental

    • Person-made

    • Internal vs. external

  • Business impact analysis

    • Recovery time objective (RTO)

    • Recovery point objective (RPO)

    • Mean time to repair (MTTR)

    • Mean time between failures (MTBF)

    • Functional recovery plans

    • Single point of failure

    • Disaster recovery plan (DRP)

    • Mission essential functions

    • Identification of critical systems

    • Site risk assessment

As it relates to computer security, a risk is the possibility of a malicious attack or other threat causing damage or downtime to a computer system. Generally, this is done by exploiting vulnerabilities in a computer system or network. The more vulnerability, the more risk. Organizations should be extremely interested in managing vulnerabilities and thereby managing risk. Risk management can be defined as the identification, assessment, and prioritization of risks, and the mitigating and monitoring of those risks. Specifically, when we talk about computer hardware and software, risk management is also known as information assurance (IA). In this chapter we start by discussing the different risk types and then move on to understanding risk management strategies and risk analysis. The chapter finishes with a discussion of disasters and business impact analysis.

“Do I Know This Already?” Quiz

The “Do I Know This Already?” quiz enables you to assess whether you should read this entire chapter thoroughly or jump to the “Chapter Review Activities” section. If you are in doubt about your answers to these questions or your own assessment of your knowledge of the topics, read the entire chapter. Table 34-1 lists the major headings in this chapter and their corresponding “Do I Know This Already?” quiz questions. You can find the answers in Appendix A, “Answers to the ‘Do I Know This Already?’ Quizzes and Review Questions.”

Table 34-1 “Do I Know This Already?” Section-to-Question Mapping

Foundation Topics Section

Questions

Risk Types

1–2

Risk Management Strategies

3–5

Risk Analysis

6–8

Disaster Analysis

9

Business Impact Analysis

10–11

Caution

The goal of self-assessment is to gauge your mastery of the topics in this chapter. If you do not know the answer to a question or are only partially sure of the answer, you should mark that question as wrong for purposes of the self-assessment. Giving yourself credit for an answer you correctly guess skews your self-assessment results and might provide you with a false sense of security.

1. What type of risk would a nation state threat actor be considered to a private corporation?

  1. External

  2. Internal

  3. Multilateral

  4. None of these answers are correct.

2. Your organization has just hired a new employee. However, the employee is really a spy looking to steal your intellectual property. Which type of risk would this be considered?

  1. External

  2. Internal

  3. Multivendor

  4. All of these answers are correct.

3. Which of the following is defined as the identification, assessment, and prioritization of risks, and the mitigating and monitoring of those risks?

  1. Risk management

  2. Risk mitigation

  3. Risk tolerance

  4. None of these answers are correct.

4. Which of the following usually entails not carrying out a proposed plan because the risk factor is too great?

  1. Risk transference

  2. Risk assessment

  3. Risk mitigation

  4. Risk avoidance

5. Which of the following is an example of risk transference?

  1. Risk appetite

  2. Risk avoidance

  3. Cybersecurity insurance

  4. All of these answers are correct.

6. Which of the following is the attempt to determine the number of threats or hazards that could possibly occur in a given amount of time to your computers and networks?

  1. Risk avoidance

  2. Risk assessment

  3. Risk mitigation

  4. Risk acceptance

7. _____________ risk assessment is an assessment that assigns numeric values to the probability of a risk and the impact it can have on the system or network.

  1. Quantitative

  2. Qualitative

  3. Measured

  4. None of these answers are correct.

8. _______________ is the total reduction or elimination of a risk.

  1. Risk assessment

  2. Risk transference

  3. Risk mitigation

  4. Risk acceptance

9. Which type of disaster can be defined as being caused by the influence of humans?

  1. Person-made

  2. Environmental

  3. Weather-made

  4. None of these answers are correct.

10. ___________ defines the average number of failures per million hours of operation for a product in question.

  1. Mean time between failures (MTBF)

  2. Mean time to repair (MTTR)

  3. Mean time between recovery (MTBR)

  4. None of these answers are correct

11. ___________ is the measured period of time between failures of a system.

  1. Mean time between failures (MTBF)

  2. Mean time to failure (MTTF)

  3. Mean time between recovery (MTBR)

  4. None of these answers are correct.

Foundation Topics

Risk Types

According to NIST SP 800-137, the definition of risk is as follows:

A measure of the extent to which an entity is threatened by a potential circumstance or event, and typically a function of: (i) the adverse impacts that would arise if the circumstance or event occurs; and (ii) the likelihood of occurrence. [Note: Information system-related security risks are those risks that arise from the loss of confidentiality, integrity, or availability of information or information systems and reflect the potential adverse impacts to organizational operations (including mission, functions, image, or reputation), organizational assets, individuals, other organizations, and the Nation. Adverse impacts to the Nation include, for example, compromises to information systems that support critical infrastructure applications or are paramount to government continuity of operations as defined by the Department of Homeland Security.]

Risks to your organization or environment can come in many shapes and forms. Let’s first talk about the primary concern of most organizations: external risk. This is, of course, the biggest concern to most because it is risk that comes from an external entity who could have many different motivations and therefore targets inside your organization. Many imagine an external “hacker” as someone sitting in a basement hammering away at the keyboard in front of 10 different monitors in a black hoodie. That, of course, is usually not the case. External risk most likely is from an organized threat actor or an organization of threat actors who have various motivations. To carry out their attacks, they will use many different methods, many of which we have discussed in depth in other sections of this book. The primary goal of external attackers is to gain access to your organization’s computing environment, gain a foothold, and keep it as long as possible to carry out their objectives, whatever they may be.

Of course, what many organizations tend to overlook are the internal risks. The majority of internal risks stems from employees or those internal to the organization such as contractors. While the motivation of external cybercriminals may be to gain access and keep access, the internal threat actor already has access to the organization’s environment. This person’s motivations are usually different from those of external threat actors—although the goal may be the same in the end. Most internal attacks result in the exfiltration or destruction of sensitive data that belongs to the organization. The theft of intellectual property is a primary goal of internal and external threat actors.

Let’s talk about the risks that exist within the organization that could lead to internal or external threat actors obtaining their goal. As we have discussed, external threat actors want to maintain access that they have obtained. To do this, they often use the technique of pivoting and scanning the internal environment that they now have access to. An easy target for them to gain another foothold would be a legacy system on the network that is out of date and contains vulnerabilities that allow an additional attack to be carried out. Many organizations take the approach of a hard outer shell to secure their network and allow for a softer internal attack surface. Often this is caused by a lack of software compliance/licensing processes within the organization. The belief in these organizations is that the internal legacy systems are not at risk because they are behind a firewall and not directly accessible from the Internet. As you know, this is a misconception. As soon as your external network is breached, all of those internal legacy systems become prime targets. Of course, let’s not forget that in the cybersecurity world things are not exactly black and white. Some risks would be considered multiparty. External threat actors could be working with internal threat actors. Also, external threat actors could have access to an asset such as a contractor laptop that then gives them that internal access they are looking to obtain.

Risk Management Strategies

In this section we dive deeper into the details of risk management strategies. Organizations usually employ one of the four following general strategies when managing a particular risk:

  • Transfer the risk to another organization or third party.

  • Avoid the risk.

  • Mitigate the risk.

  • Accept some or all of the consequences of a risk.

It is possible to transfer some risk to a third party. An example of risk transference (also known as risk sharing) would be an organization that purchases cybersecurity insurance for a group of servers in a data center. The organization still takes on the risk of losing data in the case of server failure, theft, and disaster but transfers the risk of losing the money those servers are worth in case they are lost.

Some organizations opt to avoid risk. Risk avoidance usually entails not carrying out a proposed plan because the risk factor is too great. An example of risk avoidance: If a high-profile organization decided not to implement a new and controversial website based on its belief that too many attackers would attempt to exploit it.

However, the most common goal of risk management is to mitigate all risk to a level acceptable to the organization. It is impossible to eliminate all risk, but it should be mitigated as much as possible within reason. Usually, budgeting and IT resources dictate the level of risk mitigation and what kinds of deterrents can be put in place. For example, installing antivirus/firewall software on every client computer is common; most companies do this. However, installing a high-end, hardware-based firewall at every computer is not common; although this method would probably make for a secure network, the amount of money and administration needed to implement that solution would make it unacceptable.

Most organizations are willing to accept a certain amount of risk. This is risk acceptance, also known as risk retention. Sometimes, vulnerabilities that would otherwise be mitigated by the implementation of expensive solutions are instead dealt with when and if they are exploited. IT budgeting and resource management are big factors when it comes to these risk management decisions.

After the risk transference, risk avoidance, and risk mitigation techniques have been implemented, an organization is left with a certain amount of residual risk—the risk left over after a detailed security plan and disaster recovery plan have been implemented. There is always risk because a company cannot possibly foresee every future event, nor can it secure against every single threat. Senior management as a collective whole is ultimately responsible for deciding how much residual risk there will be in a company’s infrastructure and how much risk there will be to the company’s data. Often, no one person will be in charge of this level, but it will be decided on as a group.

There are many different types of risks to computers and computer networks. Of course, before you can decide what to do about particular risks, you need to assess what those risks are.

Risk Analysis

Risk assessment is the attempt to determine the number of threats or hazards that could possibly occur in a given amount of time to your computers and networks. When you assess risks, they are often recognized threats, but risk assessment can also take into account new types of threats that might occur. When risk has been assessed, it can be mitigated up until the point at which the organization will accept any residual risk. Generally, risk assessments follow a particular order, for example:

Step 1. Identify the organization’s assets.

Step 2. Identify vulnerabilities.

Step 3. Identify threats and threat likelihood.

Step 4. Identify potential monetary impact.

The fourth step is also known as impact assessment. At this point, you determine the potential monetary costs related to a threat.

An excellent tool to create during your risk assessment is a risk register, also known as a risk log, which helps track issues and address problems as they occur. After the initial risk assessment, you, as security administrator, will continue to use and refer to the risk register. It can be a great tool for just about any organization but can be of more value to certain types of organizations, such as manufacturers that utilize a supply chain. In this case, the organization would want to implement a specialized type of risk management called supply chain risk management (SCRM). In this type, the organization collaborates with suppliers and distributors to analyze and reduce risk. One approach that an organization can take to identify risks and controls is the application of a risk control assessment. This assessment can be completed by a third party or by an internal team. The internal team can complete a risk control self-assessment (RCSA). This process is often used by financial institutions that need to meet regulatory compliances. A great approach to visually representing the results of a risk assessment is the use of a risk matrix/heat map. See Figure 34-1 for an example.

A risk matrix/heat map.

FIGURE 34-1 Risk Matrix/Heat Map

Table 34-2 summarizes common terms associated with risk analysis.

Table 34-2 Common Risk Terms

Term

Description

Risk appetite

The types and amount of risk, on a broad level, an organization is willing to accept in its pursuit of value.

Inherent risk

The representation of the level of risk an organization would experience if the correct mitigation was not in place.

Residual risk

The risk left over after a detailed security plan and disaster recovery plan have been implemented.

Control risk

The risk that a control that is in place may not detect or may fail to protect the environment.

Risk awareness

The ability of an organization to identify risks before they become a threat. The overall preparedness of an organization to mitigate risk.

Risk mitigation

NIST defines risk mitigation as “Prioritizing, evaluating, and implementing the appropriate risk-reducing controls/countermeasures recommended from the risk management process.”

The two most common risk assessment methods are qualitative and quantitative.

Qualitative Risk Assessment

Qualitative risk assessment is an assessment that assigns numeric values to the probability of a risk and the impact it can have on the system or network. Unlike its counterpart, quantitative risk assessment, it does not assign monetary values to assets or possible losses. It is the easier, quicker, and cheaper way to assess risk but cannot assign asset value or give a total for possible monetary loss.

With this method, ranges can be assigned, for example, 1 to 10 or 1 to 100. The higher the number, the higher the probability of risk, or the greater the impact on the system. As a basic example, a computer without antivirus software that is connected to the Internet will most likely have a high probability of risk; it will also most likely have a great impact on the system. You could assign the number 99 as the probability of risk. You are not sure exactly when it will happen but are 99 percent sure that it will happen at some point. Next, you could assign the number 90 out of 100 as the impact of the risk. This number implies a heavy impact; probably either the system has crashed or has been rendered unusable at some point. There is a 10 percent chance that the system will remain usable, but it is unlikely. Finally, you multiply the two numbers together to find out the qualitative risk: 99 × 90 = 8910. That’s 8910 out of a possible 10,000, which is a high level of risk. The way to mitigate risk in this example would be to install antivirus software and verify that it is configured to auto-update. By assigning these types of qualitative values to various risks, you can make comparisons from one risk to another and get a better idea of what needs to be mitigated and what doesn’t.

The main issue with this type of risk assessment is that it is difficult to place an exact value on many types of risks. The type of qualitative system varies from organization to organization, even from person to person; it is a common source of debate as well. This makes qualitative risk assessments more descriptive than truly measurable. However, by relying on group surveys, company history, and personal experience, you can get a basic idea of the risk involved.

Quantitative Risk Assessment

Quantitative risk assessment measures risk by using exact monetary values. It attempts to give an expected yearly loss in dollars for any given risk. It also defines asset values to servers, routers, and other network equipment.

Three values are used when making quantitative risk calculations:

  • Single loss expectancy (SLE): The loss of value in dollars based on a single incident.

  • Annualized rate of occurrence (ARO): The number of times per year that the specific incident occurs.

  • Annualized loss expectancy (ALE): The total loss in dollars per year due to a specific incident. The incident might happen once or more than once; either way, this number is the total loss in dollars for that particular type of incident. It is computed with the following calculation:

    SLE × ARO = ALE

So, for example, suppose you wanted to find out how much an e-commerce web server’s downtime would cost the company per year. You would need some additional information such as the average web server downtime in minutes and the number of times this occurs per year. You also would need to know the average sale amount in dollars and how many sales are made per minute on this e-commerce web server. This information can be deduced by using accounting reports and by further security analysis of the web server, which we discuss later. For now, let’s just say that over the past year the web server failed seven times. The average downtime for each failure was 45 minutes. That equals a total of 315 minutes of downtime per year, close to 99.9 percent uptime. (The more years you can measure, the better the estimate will be.) Now let’s say that this web server processes an average of 10 orders per minute with average revenue of $35. That means that $350 of revenue comes in per minute. As mentioned, a single downtime averages 45 minutes, corresponding to a $15,750 loss per occurrence. So, the SLE is $15,750. Ouch! Some salespeople are going to be unhappy with your 99.9 percent uptime! But you’re not done. You can calculate the annualized loss expectancy (ALE) by multiplying the SLE ($15,750) by the annualized rate of occurrence (ARO). Because the web server failed seven times last year, the SLE × ARO would be $15,750 × 7, which equals $110,250 (the ALE). This example is shown in Table 34-3.

Table 34-3 Quantitative Risk Assessment Example

SLE

ARO

ALE

$15,750

7

$110,250

Revenue lost due to each web server failure

Total web server failures over the past year

Total loss due to web server failure per year

Apparently, you need to increase the uptime of the e-commerce web server! Many organizations demand 99.99 percent or even 99.999 percent uptime; 99.999 percent uptime means that the server will have only 5 minutes of downtime over the entire course of the year. Of course, to accomplish this, you first need to scrutinize the server to see precisely why it fails so often. What exactly are the vulnerabilities of the web server? Which ones were exploited? Which threats exploited those vulnerabilities? By exploring the server’s logs, configurations, and policies, and by using security tools, you can discern exactly why this happens so often. However, this analysis should be done carefully because the server does so much business for the company.

It isn’t possible to assign a specific ALE to incidents that will happen in the future, so new technologies should be monitored carefully. Any failures should be documented thoroughly. For example, a spreadsheet could be maintained that contains the various technologies your organization uses; their failure history; their SLE, ARO, and ALE; and mitigation techniques that you have employed, and when they were implemented.

Table 34-4 compares the different aspects of quantitative and qualitative risk.

Table 34-4 Risk Assessment Types

Risk Assessment Type

Description

Key Points

Qualitative risk assessment

Assigns numeric values to the probability of a risk, and the impact it can have on the system or network.

Numbers are arbitrary.

Examples: 1–10 or 1–100.

Quantitative risk assessment

Measures risk by using exact monetary values. It attempts to give an expected yearly loss in dollars for any given risk.

Values are specific monetary amounts.

SLE × ARO = ALE

MTBF can be used for additional data.

Note

Most organizations within the medical, pharmaceutical, and banking industries use quantitative risk assessments; they need to have specific monetary numbers to measure risk. Taking this one step further, many banking institutions adhere to the recommendations within the Basel I, II, and III Accords. These recommended standards describe how much capital a bank should put aside to aid with financial and operational risks if they occur.

Disaster Analysis

No matter how much redundancy you implement, there is always a chance that a disaster could arise. A disaster could be the loss of data on a server, a fire in a server room, or the catastrophic loss of access to an organization’s building. To prepare for these events, you should design a disaster recovery plan, but with the thought in mind that redundancy and fault tolerance can defend against most “disasters.” The best administrator is the one who avoids disaster and, in the rare case that it does happen, has a plan in place to recover quickly from it.

Before you can plan for disasters, you need to define exactly what disasters are possible and list them in order starting with the most probable. This step sounds a bit morbid, but it’s necessary to ensure the long-term welfare of your organization.

Disasters can be divided into two categories: environmental and person-made. They can also be looked at from the perspective of internal versus external. Some of the disasters that could render your server room inoperable include the following:

  • Fire: Fire is probably the number one planned-for disaster. This is partially because most municipalities require some sort of fire suppression system, as well as the fact that most organizations’ policies define the usage of a proper fire suppression system. The three main types of fire extinguishers include A (for ash fires), B (for gas and other flammable liquid fires), and C (for electrical fires). Unfortunately, these and the standard sprinkler system in the rest of the building are not adequate for a server room. If there were a fire, the material from the fire extinguisher or the water from the sprinkler system would damage the equipment, making the disaster even worse! Instead, a server room should be equipped with a proper system of its own such as DuPont FM-200. This system uses a large tank that stores a clean agent fire extinguishant that is sprayed from one or more nozzles in the ceiling of the server room. It can put out fires of all types in seconds. A product such as this can be used safely when people are present; however, most systems also employ a very loud alarm that tells all personnel to leave the server room. It is wise to run through several fire suppression alarm tests and fire drills, ensuring that the alarm will sound when necessary and that personnel know what do to when the alarm sounds. For example, escape plans should be posted, and battery-backup exit signs should be installed in various locations throughout the building so that employees know the quickest escape route in the case of a fire. Fire drills (and other safety drills) should be performed periodically so that the organization can analyze the security posture of their safety plan.

  • Flood: The best way to avoid server room damage in the case of a flood is to locate the server room on the first floor or higher, not in a basement. There’s not much you can do about the location of a building, but if it is in a flood zone, it makes the use of a warm or hot site that much more imperative. And a server room could also be flooded by other things such as boilers. The room should not be adjacent to, or on the same floor as, a boiler room. It should also be located away from other water sources such as bathrooms and any sprinkler systems. The server room should be thought of three-dimensionally; the floors, walls, and ceiling should be analyzed and protected. Some server rooms are designed to be a room within a room and might have drainage installed as well.

  • Long-term power loss: Short-term power loss should be countered by the UPS, but long-term power loss requires a backup generator and possibly a redundant site.

  • Theft and malicious attack: Theft and malicious attack can also cause a disaster, if the right data is stolen. Physical security such as door locks/access systems and video cameras should be implemented to avoid this. Servers should be cable-locked to their server racks, and removable hard drives (if any are used) should have key access. Not only do you, as security administrator, have the task of writing policies and procedures that govern the security of server rooms and data centers, but you will often have the task of enforcing those policies—meaning muscle in the form of security guards, and dual-class technician/guards—or by otherwise having the right to terminate employees as needed, contact and work with the authorities, and so on.

  • Loss of building: Temporary loss of the building due to gas leak, malicious attack, inaccessibility due to crime scene investigation, or natural event will require personnel to access a redundant site. Your server room should have as much data archived as possible, and the redundant site should be warm enough to keep business running. A plan should be in place as to how data will be restored at the redundant site and how the network will be made functional.

Business Impact Analysis

Next, we discuss business impact analysis. We start by covering recovery time objectives and recovery point objectives, continuing on with a description of mean time to repair and mean time between failures. From there, we cover functional recovery plans, single point of failure, and a disaster recovery plan. We conclude with a look it mission-essential functions, identification of critical systems, and site risk assessment.

Although it’s impossible to predict the future accurately, it can be quantified on an average basis using concepts such as mean time between failures (MTBF). This term deals with reliability. It defines the average number of failures per million hours of operation for a product in question. This number is based on historical baselines among various customers who use the product. It can be very helpful when making quantitative assessments.

Note

Another way of describing MTBF is called failure in time (FIT), which is the number of failures per billion hours of operation.

You should know two other terms related to MTBF: mean time to repair (MTTR), which is the time needed to repair a failed device, and mean time to failure (MTTF), which is a basic measure of reliability for devices that cannot be repaired. All three of these concepts should also be considered when creating a disaster recovery plan (DRP).

When an environment is planned properly, it can withstand most failures barring total disaster using the following redundancy precautions:

  • Redundant power in the form of power supplies, UPSs, and backup generators

  • Redundant data, servers, ISPs, and sites

The whole concept revolves around single points of failure. A single point of failure is an element, object, or part of a system that, if it fails, causes the whole system to fail. By implementing redundancy, you can bypass just about any single point of failure.

There are two methods to combating single points of failure. The first is to use redundancy. If employed properly, redundancy keeps a system running with no downtime. However, this solution can be pricey, and we all know there is only so much IT budget to go around. So, the alternative is to make sure you have plenty of spare parts lying around. This is a good method if your network and systems are not time-critical. Installing spare parts often requires you to shut down the server or a portion of a network. If this risk is not acceptable to an organization, you’ll have to find the cheapest redundant solutions available. Research is key, but don’t be fooled by the hype: sometimes the simplest sounding solutions are the best.

Here’s the scenario. Your server room has the following powered equipment:

  • Nine servers

  • Two Microsoft domain controllers (DCs)

  • One DNS server

  • Two file servers

  • One database server

  • Two web servers (which second as FTP servers)

  • One mail server

  • Five 48-port switches

  • One master switch

  • Three routers

  • Two CSU/DSUs

  • One PBX

  • Two client workstations (for remote server access without having to work directly at the server) within the server room

It appears that there is already some redundancy in place in this server room. For example, there are two domain controllers. One of them has a copy of Active Directory and acts as a secondary DC in case the first one fails. There are also two web servers, one ready to take over for the other if the primary one fails. This type of redundancy is known as failover redundancy. The secondary system is inactive until the first one fails. Also, two client workstations are used to remotely control the servers; if one fails, another one is available.

Otherwise, the rest of the servers and other pieces of equipment are one-offs—single instances in need of something to prevent failure. There are a lot of them, so you truly need to redundacize. Hey, it’s a word if IT people use it! It’s the detailed approach to preparing for problems that can arise in a system that will make for a good IT contingency plan. Try to envision the various upcoming redundancy methods used with each of the items listed previously in the fictitious server room.

But before we get into some hard-core redundancy, let’s discuss the terms fail-open and fail-closed. Fail-open means that if a portion of a system fails, the rest of the system will still be available or “open.” Fail-closed means that if a portion of a system fails, the entire system will become inaccessible or simply shut down. Depending on the level of security your organization requires, you might have a mixture of fail-open and fail-closed systems. In the previous server room example, there is a DNS server and a database server. Let’s say that the DNS server forwards information to several different zones and that one of those zones fails for one reason or another. You might decide that it is more beneficial to the network to have the rest of the DNS server continue to operate and service the rest of the zones instead of shutting down completely, so you would want the DNS server to fail-open. However, the database server might have confidential information that you cannot afford to lose, so if one service or component of the database server fails, you might opt to have the database server stop servicing requests altogether, or in other words, to fail-closed. Another example would be a firewall/router. If the firewall portion of the device failed, you would probably want the device to fail-closed. Even though the network connectivity could still function, you probably wouldn’t want it to because there is no firewall protection. Your solution depends on the level of security you require and the risk that can be associated with devices that fail-open. It also depends on whether the server or device has a redundancy associated with it. If the DNS server mentioned previously has a secondary redundant DNS server that is always up and running and ready to take requests at a moment’s notice, you might opt to instead configure the first DNS server to fail-closed and let the secondary DNS server take over entirely.

Disaster Recovery Planning

Disaster recovery plans should include information regarding redundancy, such as sites and backup, but should not include information that deals with the day-to-day operations of an organization, such as updating computers, patch management, monitoring and audits, and so on. It is important to include only what is necessary in a disaster recovery plan. Too much information can make it difficult to use when a disaster does strike.

Although not an exhaustive set, the following written disaster recovery policies, procedures, and information should be part of your disaster recovery plan:

  • Contact information: You should identify the people or resources to contact if a disaster occurs and how employees will contact the organization.

  • Impact determination: This procedure determines a disaster’s full impact on the organization. It includes an evaluation of assets lost and the cost to replace those assets.

  • Functional recovery plan: This plan is based on the determination of disaster impact. It will have many permutations depending on the type of disaster. The recovery plan includes an estimated time to complete recovery and a set of steps defining the order of what will be recovered and when. It might also include an after action report (AAR), which is a formal document designed to determine the effectiveness of a recovery plan in the case that it was implemented.

  • Business continuity plan: This plan defines how the business will continue to operate if a disaster occurs; the BCP is often carried out by a team of individuals. A BCP is also referred to as a continuity of operations plan (COOP). Over the years, BCPs have become much more important, and depending on the organization, a BCP might actually encompass the entire DRP. It also comprises business impact analysis—the examination of critical versus noncritical functions. These functions are assigned two different values or metrics: recovery time objective (RTO), the acceptable amount of time to restore a function (for example, the time required for a service to be restored after a disaster), and recovery point objective (RPO), the acceptable latency of data, or the maximum tolerable time that data can remain inaccessible after a disaster. It’s impossible to foresee exactly how long it will take to restore service after a disaster, but with the use of proper archival, hot/warm/cold sites, and redundant systems, a general timeframe can be laid out, and an organization will be able to decide on a maximum timeframe to get data back online. This, in effect, is IT contingency planning (ITCP).

Some organizations have a continuity of operation planning group or crisis management group that meets every so often to discuss the BCP. Instead of running full-scale drills, they might run through tabletop exercises, where a talk-through of simulated disasters (in real time) is performed—a sort of role playing, if you will. This approach can save time and be less disruptive to employees, but it is more than just a read-through of the BCP. It can help identify critical systems and mission-essential functions of the organization’s network as well as failover functionality and alternate processing sites. It can also aid in assessing the impact of a potential disaster on privacy, property, finance, the reputation of the company, and most importantly, life itself.

  • Copies of agreements: Copies of any agreements with vendors of redundant sites, ISPs, building management, and so on should be stored with the DR plan.

  • Disaster recovery drills and exercises: Employees should be drilled on what to do if a disaster occurs. These exercises should be written out step by step and should conform to safety standards.

  • Hierarchical list of critical systems and critical data: This list includes all the mission-essential functions and identification of critical systems necessary for business operations: domain controllers, firewalls, switches, DNS servers, file servers, web servers, and so on. They should be listed by priority. Systems such as client computers, test computers, and training systems would be last on the list or not listed at all. You should also include (somewhere in the DRP) some geographic considerations. For example, are there offsite backups or virtualization in place? What is the physical distance to those backups and virtual machines? And, are there legal implications? For instance, are there data sovereignty implications—meaning, will it be difficult to gain access to data and VMs stored in a different country based on the laws of that country? For each disaster recovery site, a site risk assessment should be completed to determine the risk based on the actual site-specific conditions. For instance, the geographical location of the DR site might raise additional risks that other sites do not face.

Generally, the chief security officer (CSO) or other high-level executive will be in charge of DR planning, often with the help of the information systems security officer (ISSO); however, who is in charge depends on the size of the organization and the types of management involved. That said, any size organization can benefit from proper DR planning. This information should be accessible at the company site, and a copy should be stored offsite as well. If your organization conforms to special compliance rules, you should consult them when designing a DR plan. Depending on the type of organization, yet other items might go into your DR plan.

Chapter Review Activities

Use the features in this section to study and review the topics in this chapter.

Review Key Topics

Review the most important topics in the chapter, noted with the Key Topic icon in the outer margin of the page. Table 34-5 lists a reference of these key topics and the page number on which each is found.

Table 34-5 Key Topics for Chapter 34

Key Topic Element

Description

Page Number

Section

Risk Types

917

Section

Risk Management Strategies

918

Section

Risk Analysis

919

Table 34-3

Quantitative Risk Assessment Example

923

Table 34-4

Risk Assessment Types

923

Section

Disaster Analysis

924

Section

Business Impact Analysis

926

Define Key Terms

Define the following key terms from this chapter, and check your answers in the glossary:

external risk

internal risk

theft of intellectual property

software compliance/licensing

legacy systems

multiparty

risk management

information assurance (IA)

risk transference

cybersecurity insurance

risk avoidance

risk mitigation

risk acceptance

residual risk

risk assessment

threat likelihood

impact assessment

risk register

risk control assessment

risk control self-assessment

risk matrix/heat map

risk appetite

inherent risk

residual risk

risk awareness

risk mitigation

qualitative risk assessment

asset value

impact

quantitative risk assessment

asset values

single loss expectancy (SLE)

annualized rate of occurrence (ARO)

annualized loss expectancy (ALE)

environmental disaster

person-made disaster

mean time between failures (MTBF)

mean time to repair (MTTR)

mean time to failure (MTTF)

disaster recovery plan (DRP)

single point of failure

recovery time objective (RTO)

recovery point objective (RPO)

mission-essential functions

identification of critical systems

site risk assessment

Review Questions

Answer the following review questions. Check your answers with the answer key in Appendix A.

1. Which type of plan is based on the determination of disaster impact?

2. ____________is the time required for a service to be restored after a disaster.

3. What procedure is used to determine a disaster’s full impact on the organization?

4. What is considered the risk left over after a detailed security plan and disaster recovery plan have been implemented?

5. What is considered an element, object, or part of a system that, if it fails, causes the whole system to fail?

6. ___________ defines the average number of failures per million hours of operation for a product in question.

7. Which type of assessment measures risk by using exact monetary values?

8. What term is used when risk is reduced or eliminated altogether?

9. Which type of assessment assigns numeric values to the probability of a risk and the impact it can have on the system or network?

10. What is the attempt to determine the number of threats or hazards that could possibly occur in a given amount of time to your computers and networks?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.240.142