DOMAIN 3
Cloud Platform and Infrastructure Security

THE CLOUD SECURITY PROFESSIONAL must understand the cloud infrastructure for each of the cloud delivery models. The infrastructure includes the physical components of the cloud, the services they provide, and the communication infrastructure that allows us to connect with the cloud. Each part of the infrastructure has specific security needs and shared security responsibilities.

In the shared security model of cloud computing, it is easy, but incorrect, to assume that security is the sole responsibility of the Cloud Service Provider (CSP). The cloud security professional needs to understand the unique security needs of the cloud environment and the technologies, such as virtualization, that make the cloud possible. The security professional must understand the nuances affecting cloud security, including which part of security is the responsibility of the CSP and which part is the responsibility of the customer. The security professional must be clear on how the security provided by the CSP and the customer work together to protect the customer's processes and data.

Finally, the cloud security professional must understand how the cloud can support the business. The cloud supports more than day-to-day operations. The cloud also provides important tools to support business continuity and disaster recovery. In this role, the cloud becomes more of a business continuity and resiliency tool.

COMPREHEND CLOUD INFRASTRUCTURE COMPONENTS

There are common components to all cloud infrastructure. These components are all physically located with the CSP, but many are accessible via the network. In the shared responsibility security model, both the customer and the CSP share security responsibilities for the cloud environment. Those responsibilities are discussed with each component.

Physical Environment

The physical environment is composed of the server rooms, data centers, and other physical locations of the CSP. The identity of the CSP may vary by the different cloud security models, as follows:

  • A private cloud is often built by a company on-premise. If this occurs, the company/customer is the CSP. If the private cloud is built using a commercial vendor, such as AWS or Azure, the CSP is the commercial vendor.
  • In a community cloud, a member of the community often hosts the space and equipment needed. The community member hosting the cloud is the CSP. If the community cloud is built using a commercial cloud service, such as AWS or Azure, the CSP is the commercial vendor.
  • In public clouds, the CSP is the company providing the service, such as AWS, Microsoft, Google, IBM, and so on. This commercial vendor is the CSP.

The physical environment is under the sole control of the CSP. CSPs provide all physical security. They are also responsible for monitoring, auditing, and maintaining a secure environment. The security risks are those common to all data centers, regardless of whether they are on-premise or provided by a third party. These include physical security of the equipment, identity and access management, data integrity and confidentiality, and so on.

The CSP uses the typical methods for these risks. For physical security these measures include locks, security personal, lights, fences, etc. Identity and access management (IAM) will use tools such as SAM or OAuth2 or may use a cloud vendor such as Okta to provide these services. IAM allows single sign-on (SSO) and is supported by multifactor authentication, tokens, and other authentication tools. Data confidentiality and integrity are supported by IAM and the use of cryptography technologies such as encryption, message digest, digital certificates, and public-key infrastructures (PKI), and transmittal methods such as Secure Sockets Layer (SSL), and virtual private networks (VPNs).

Network and Communications

The networking components housed by the CSP are the responsibility of the CSP, and they maintain security. Components housed at the customer’s facility (on-premise) are the responsibility of the customer. The largest area of concern is the public Internet that exists between these two. Who is responsible for the security of the public Internet? The answer would be the individuals using it. Security is a shared responsibility.

For this reason, the CSP must support and the customer must use secure protocols such as HTTPS, encrypt the data prior to transmission, or use a VPN for secure data communication. At each end of the transmission pipeline, both the customer and the CSP are responsible for the firewalls and other systems needed to maintain secure communications.

In the shared security model, the CSP provides tools for secure computing, logging, encryption, and so on. However, it may be the responsibility of the customer to properly set up and configure these services. In all cases, it is the responsibility of the customer to connect to and transmit data to the service securely.

Consider, for example, that you are a company shipping a product. You have many ways in which you can accomplish this.

  • You could create your own fleet, leasing aircraft from a supplier, but providing your own pilots and cargo personnel. In this case, it is similar to an infrastructure as a service (IaaS) environment. The supplier is responsible for the aircraft provided, but not how you use the aircraft or the safety and security of your product.
  • In a different approach, the supplier may provide the aircraft and pilots, and you handle the storage and security of the cargo. You remain responsible for the safety and security of your product, and the supplier provides a safe aircraft that is flown safely.
  • In the simplest method, you drop off the package with the suppler. Once in their possession, the supplier is responsible for the security of the package. They are not responsible for the manner in which the product is packaged or the condition of the product when it is dropped off with the service.

IaaS, platform as a service (PaaS), and software as a service (SaaS) work similarly to the previous list of examples.

  • In an IaaS service model, the customer is responsible for configuring the environment as well as enforcing company policies in the use of systems, as if the systems were on-premise as well as the connection to the CSP. The CSP is responsible for the technology provided but not how it is used.
  • In a PaaS module, the CSP is responsible for the physical components, the internal network, and the tools provided. The customer is responsible for the proper use of those tools and the connection to the CSP.
  • In a SaaS model, the customer remains responsible for access to the cloud service in a secure manner—using appropriate technologies to connect with and transmit data to and from the service securely.

Compute

The compute resources are the infrastructure components that deliver virtual machines (VMs), disk, processor, memory, and network resources. These resources are owned by and under the direct control of the CSP. The security issues are the same for these resources in a cloud environment as they are on-premise with the additional challenge of multitenancy.

The CSP in every delivery and service model remains responsible for the maintenance and security of the physical components. These are owned and supported by the CSP. The customer in every delivery and service model remains responsible for their data and their users. Between the physical components, there is a vast array of software and other components. Who is responsible for each of these remaining parts varies by service and delivery model and sometimes by the CSP. The contract between the customer and the CSP should spell out the responsibilities for each part of the cloud environment. Typical responsibilities will be described for each service model (IaaS, PaaS, and SaaS).

In an IaaS environment, the CSP provides the hardware components and may provide networking, virtualization, and operating systems. The security of physical components provided by the CSP are the responsibility of the Internet Service Provider (ISP). When the CSP provides software components, such as virtualization and Operating System (OS) software, they are responsible for the versioning and security of those components.

When the CSP provides the virtualization and OS software, some of the software configuration may be left to the customer. When this is the case, the customer is responsible for the security implications of the configuration they choose. In the IaaS service model, the customer has the most responsibility for security. Other than the hardware components and system software provided, the customer remains responsible for all other security for the tools they install, software they develop, and, of course, all identity and access management, customer records, and other data. These responsibilities include the patching and versioning of the software installed or developed by the user and the security of data at rest and in motion.

In a PaaS environment, the CSP is responsible for the security and maintenance of all the components and systems they were responsible for in the IaaS service model. In addition, the CSP is responsible for all additional services they provide. These services can be storage, transform and analysis, and numerous other services. While the customer may have some ability to configure these services, all services provided by the CSP will usually be maintained by the CSP, to include patching and versioning. The customer is responsible for the configuration and use of all CSP systems and services provided as well as the patching and versioning of all software the customer installs or develops. The customer is always responsible for the security of their data and users. The contract with the CSP should address these issues.

In a SaaS environment, the customer is generally responsible for the customization of the SaaS service, as well as the security of their users and the data used or produced by the service. The CSP is responsible for the security of all other compute resources.

Virtualization

There are two types of two types of hypervisors that provide virtualization. These are Type-1 hypervisors (also known as bare-metal hypervisors) and Type-2 hypervisors (also known as hosted hypervisors). Among the major CSPs, a Type-1 hypervisor is more common. These include the version of the Xen hypervisor provided by AWS and the Hyper-V hypervisor provided in Azure.

In a Type-1 hypervisor, the hypervisor sits on the physical server and its associated hardware components. The Type-1 hypervisor does not sit on an OS. Rather, the hypervisor provides the OS functionality. VMs sit atop the hypervisor. You manage the virtual environment using a management console separate from the hypervisor. In the Type-1 environment, VMs can move between physical machines. When done properly, it is invisible to the end user. In addition to Hyper-V and the Xen Server (now called Citrix) hypervisors, other common Type-1 hypervisors include VMware vSphere with ESX/ESXi and Oracle VM.

In a Type-2 hypervisor, there are typically three layers in the virtualization portion. At the bottom is the host OS, such as Windows macOS or Linux. The next layer is the hypervisor itself. Examples include Oracle VM VirtualBox, VMware Workstation Pro/VMware Fusion, and Windows Virtual PC. These are not usually used for enterprise solutions but are used for individual needs and testing environments. The top layer consists of the individual VMs. Management of the VMs is built into the Type-2 hypervisor.

The security of the hypervisor is always the responsibility of the CSP. The virtual network and virtual machine may be the responsibility of either the CSP or the customer, depending on the cloud service model as described in the earlier “Compute” section.

Hypervisor security is critical. If unauthorized access to the hypervisor is achieved, the attacker can access every VM on the system and potentially obtain the data stored on each VM. In both Type-1 and Type-2 hypervisors, security of the hypervisor is critical to avoid hypervisor takeover as the hypervisor controls all VMs. In a Type-2 hypervisor, host OS security is also important, as a breach of the host OS can potentially allow takeover of the hypervisor as well. Proper IAM and other controls limiting access to those with both proper credentials (authentication) and a business need (authorization) can protect your systems and data. In a cloud computing multitenant model, the problem has an additional challenge. The attacker may have permission to be on the server hosting the VMs, as the attacker may be another customer of the CSP. Security of the hypervisor is the responsibility of the CSP.

CSP hypervisor security includes preventing physical access to the servers and limiting local and remote access to the hypervisor. The access that is permitted must be logged and monitored. The CSP must also keep the hypervisor software current and updated.

The virtual network between the hypervisor and the VM is also a potential attack surface. The responsibility for security in this layer is often shared between the CSP and the customer. In a virtual network, you have virtual switches, virtual firewalls, virtual IP addresses, etc. The key to security is to isolate each virtual network so that there is no possibility to move from one virtual network to another virtual network. This isolation will reduce the possibility of attacks being launched from the physical network or from the virtual network of other tenants, preventing attacks such as VLAN hopping.

A control layer between the real and virtual devices such as switches and firewalls and the VMs can be created through the use of a software-defined networking. In AWS, when you create a virtual private cloud (VPC), the software-defined networking creates the public and private networking options (subnets, routes, etc.). To provide security to the software-defined network, you will need to manage both certificates and communication between the VM management plane and the data plane. This includes authentication, authorization, and encryption.

The security methods for a virtual network are not that much different from physical networks. But the tools used on the physical networks may be unable to see and monitor virtual traffic. Using security tools designed specifically for virtual environments is recommended.

The final attack surface in the virtualization space is the VM itself. The responsibility for the security of the VM may be shared but is usually the responsibility of the customer in an IaaS model. In the PaaS model, the security of the VM may be the responsibility of either the CSP or the customer, depending on who creates and uses the VM. If the CSP uses a VM to provide a PaaS service, the responsibility for security is the CSP’s. However, if the customer is creating VMs, the customer is responsible for the security of their VMs. In a SaaS model, the VM is created and used by the CSP to provide a service, and the responsibility of VM security rests on the CSP.

Storage

Cloud storage has a number of potential security issues. These responsibilities are shared by the customer and the CSP. At a basic level, the CSP is responsible for physical protection of data centers, and the customer is responsible for the security and privacy of data and customer information. The CSP is responsible for the security patches and maintenance of data storage technologies and other data services they provide, such as an AWS S3 bucket. The customer is responsible for using these storage tools securely.

These tools provided by the CSP will provide a set of controls for secure storage. The customer is responsible for assessing the adequacy of these controls and properly configuring and using the controls available. These controls can include how the data is accessed (through the public Internet or via a VPN, for example) and how the data is protected at rest and in motion. For example, the CSP may provide the ability to encrypt data at rest and methods to transfer data securely. It is the customer's responsibility to use these controls to protect their data. Failure to properly configure secure storage using available controls is the fault of the customer.

In a cloud environment, you lose control of the physical medium where your data is stored while retaining responsibility for the security and privacy of that data. These challenges include the inability to securely wipe physical storage and the possibility of a tenant being allocated storage space you once had allocated to you. This creates the possibility of fragments of your data files existing on another tenant's allocated storage space. You retain responsibility for this data and cannot rely on the CSP to securely wipe the physical storage areas.

Compensating controls for the lack of physical control of the storage medium include only storing data in an encrypted fashion and employing crypto shredding when the data is no longer needed. These controls will provide protection against data fragments being retrieved.

Management Plane

The management plane provides the tools (web interface and APIs) necessary to configure, monitor, and control your cloud environment. It is separate from and works with the control plane and the data plane. If you have control of the management plane, you have control of the cloud environment.

Control of the management plane is essential and starts with limiting and controlling access to the management plane. If you lose control of the management plane, you are no longer in control of your cloud environment.

The most important account to protect is root, or any named account that has administrative/superuser functionality. The start of this protection is enforcement of a strong password policy. The definition of a strong password is an evolving question that has recently been addressed by NIST in SP 800-63, Digital Identity Guidelines. In the final analysis, longer is better, and passphrases are easy to remember and difficult to guess.

A strong password policy needs to be coupled with other measures for the critical root and administrative accounts. They have the keys to the kingdom, so to speak. Multifactor authentication (MFA) should also be implemented. The best method is a hardware token that is stored securely. But other methods also improve on basic password protections, to include hardware tokens kept by individuals, or even SMS texts. In general, software solutions add some protection, but not as much as the hardware solutions. Over time, some methods will be or have been shown to be less secure than others. Where possible, less secure methods should be supplemented by or replaced by more secure methods.

Role-based access control (RBAC) or access groups are other methods to limit access to these sensitive accounts. Using RBAC or access groups makes management of these groups and permissions important. If rights are not deleted when an employee changes positions or employment, access can become too broad very quickly. Another step is to limit access to users on-premise or through a VPN, if remote work is required.

Another method for limiting access is to use attribute-based access control (ABAC), also called policy-based access control. Using this method, a variety of attributes can be used with complex Boolean expressions to determine access. Typical attributes such as username can be used as well as atypical attributes such as geographic and time restrictions. For example, you must be on a corporate endpoint attached to the corporate network locally if accessing the accounts after-hours.

Each of these methods can make accessing critical root or administrative accounts more difficult for both legitimate users and malicious users alike. How tightly you lock down these accounts is in direct proportion to how valuable the information and processes in your cloud are to your business. There is a balance in this, to create as much security as possible while maintaining reasonable business access.

Root and administrative accounts are typically the only accounts with access to the management plane. The end user will generally have some limited access to the service offering tools for provisioning, configuring, and managing resources. The degree of control will be determined by each business. The end user will normally be restricted from accessing the management plane. The separation of management and other workforce uses makes the creation of separate accounts for development, testing, and production an important method of control.

In instances where the management functions are shared between the customer and the CSP, careful separation of those functions is necessary to provide proper authorization and control. In a Cisco cloud environment, the management plane protection (MPP) tool is available. AWS provides the AWS Management Console.

These are some of the methods that can be used to protect the cloud management plane. A layered defense is important, and the amount of work used to protect the management plane is, in the end, a business decision. The cloud security professional must be aware of the methods available for protection in order to be a trusted advisor to the business in this matter.

DESIGN A SECURE DATA CENTER

Designing a secure data center can be challenging with the physical siting, environmental, logical and physical controls, and communication needs. In a cloud environment, many of these traditional concerns are the responsibility of the CSP or cloud vendor as they have physical control and ownership of the data center and the physical infrastructure. The customer may be able to review the physical, environmental, and logical controls of the underlying infrastructure of the vendor in limited cases.

If the vendor uses a CSP such as Google, Amazon, or IBM to provide their infrastructure needs, this becomes logistically impossible as each has more than 60 data centers, located in the four regions of North America; Asia-Pacific (APAC); Europe, Middle East, and Africa (EMEA); and Latin America.

Cloud customers have the ability to create a logical data center. A logical data center is a construct, much like a container, where the customer designs the services, data storage, and connectivity within their instance of the cloud service. However, the physical mapping of this design to the underlying architecture is not controlled by the customer as it would be in an on-premise data center.

Logical Design

The logical design of a data center is an abstraction. In designing a logical data center, the customer utilizes software and services provided by the CSP. If used securely, the logical design can provide a secure data center. The needs of a data center include access management, monitoring for compliance and regulatory requirements, patch management, log capture and analysis, and configuration of all services.

In a logical data center design, a perimeter needs to be established with IAM and monitoring of all attempts to access the data. Access control can be accomplished through various IAM methods, including authentication and authorization, security groups, VPCs, management and other consoles, and so on. The CSP equivalent to software firewalls, traffic monitoring, and similar services can be implemented to monitor data center activities and alert on potentially malicious behavior.

All services used should have a standard configuration. This configuration is determined by the business and specifies how each approved cloud service is to be configured and can be used. Using a standard pattern/configuration makes administering and maintaining cloud services simpler and often more secure. Variations from patterns approved by each business should be approved through an exception process that includes the risk of any variance. Secure baseline configurations can provide a more secure environment for the data center. The configuration of these cloud services can be monitored and changes can be either prevented or alerted and logged.

Connections to and from the logical data center must be secured using VPNs, Transport Layer Security (TLS), or other secure transmission methods. With an increasingly remote and mobile workforce, the need to access the data center becomes more important. A careful logical design can help to ensure that a secure data center in the cloud is possible.

Tenant Partitioning

Multitenant models make cloud computing more affordable but create some security and privacy concerns. If the walls between tenants are breached, your data is at risk. Multitenancy is not a new concept. Many business centers physically house multiple tenants. These business centers may provide some access control to the building and other general security services. But, if another tenant, a custodian, or other service vendor accesses your offices, then your data could be at risk. If the data is on whiteboards or scattered on various desks and in unsecured computing systems, it is at risk.

In a similar fashion, multiple customers share computing resources in a cloud environment. The vendor provides some basic security services, as well as maintenance and other services. However, if you leave your data “lying around,” then your data could still be at risk and exfiltrated. However, if you monitor access, provide robust IAM services, and encrypt data in transit and at rest, the risk is greatly minimized.

In our physical business center example, if you lock up your servers and all of your data but leave the keys in a desk drawer or with the business center owner, security is lessened. In the same way, the security provided by encryption is improved if the customer securely maintains their own encryption keys external to the cloud vendor.

Access Control

When creating a logical data center, a primary concern is access. A single point of access makes access control simpler and monitoring better. If you have a physical data center with multiple doors and windows, securing the data center is more difficult. This is no different in a logical data center.

One method of access control is to federate a customer's existing IAM system with access to customer cloud resources. Depending on the sophistication of the customer's IAM and its ability to properly access cloud resources, this can be an appropriate method. This choice allows the customer to control access more directly. It becomes simpler for the business to oversee who has credentials and what resources those credentials can access. It may also be possible to extend a customer's IAM system to the cloud without federation if cross-connection is unwanted.

Another method to prevent cross-connection between cloud and on-premise resources is to use identity as a service (IDaaS) to provide access to a company's cloud services. Gartner refers to this as SaaS-provided IAM or simply SaaS IAM, which keeps with the three basic cloud service models. An IDaaS solution has the benefit of providing a service that is tailored to cloud resources and services. A SaaS IAM may include a CSP service or an independent SaaS IAM.

Regardless of whether a customer's current IAM system can be used, your first line of defense is to educate your workforce to create secure credentials that are different from credentials they use on other corporate and personal accounts to decrease the impact if an IAM system is compromised.

Physical Design

Physical design is the responsibility of the owner of the cloud data center. This is generally the cloud vendor or CSP. Physical design of a data center for the CSP or cloud vendor is the same as for on-premise data centers with the additional complexity of support for multitenancy.

Location

Physical siting of a data center can limit some disaster concerns. Areas commonly impacted by natural disasters, civil unrest, or similar problems should be avoided. Multiple locations for data centers should also be considered. There is no location immune to all disasters, so locations that have different disaster risk profiles will increase availability of cloud resources. The location should also have stable power/utilities and access to multiple communication paths when possible.

Buy and Hold

A buy-and-hold decision is made when considering movement to a cloud-based data center. The considerations include the sensitivity of the data, regulations on the locations of the data, and availability of a skilled workforce capable of maintaining the on-premise location.

Sometimes a customer has data that must be kept on-premise when there are legal reasons that prevent data from crossing borders, unless the CSP contract can satisfy this requirement. Other customers may have data that is sufficiently sensitive that a breach could place individuals in harms' way. If a business does not have data that should stay on-premise, all of the data could be migrated to the cloud if a business case can be made for this migration.

For many business processes, cloud computing provides significant advantages with respect to cost, availability, and scalability. But cloud computing is not always the best approach for all business processes and data.

Environmental Design

The environmental design like the physical location of a data center is also the responsibility of the CSP or cloud vendor, except for a private cloud. Environmental design can impact the availability of cloud resources. For the cloud customer, review of the basic design of a vendor or CSP's environmental design can be part of the risk analysis of the vendor.

Heating, Ventilation, and Air Conditioning

Electrical service is clearly a concern with any computing installation and has been discussed in other sections. Equally important is the ability to provide appropriate heating, ventilation, and air conditioning (HVAC) support for the facility. This can be partly mitigated by the physical siting of the facility and the construction of the facility. A facility in Canada will need to consider heating and ventilation more carefully than air conditioning (AC). Similarly, a data center near the equator will have greater concerns for AC than heat.

Regardless, HVAC concerns should be considered carefully when choosing the site for an on-premise data center or when reviewing potential cloud vendors. An HVAC failure will affect computing resources just as surely as an electrical or communication disruption. Because of the importance of HVAC in a CSP data center, part of the review of a CSP must include capacity and redundancy of HVAC systems as well as the process for movement between data centers when necessary. This includes the geographic location of the CSP data center. A data center in Hawaii may be more concerned with ventilation and humidity. A data center in Austin, Texas, would need to focus on air conditioning, and a data center in Edmonton, Canada, would focus on heating.

A number of documents can help assess HVAC concerns. A CSP SOC-2 report should have a section on availability and the controls that provide it. This should include environmental systems. SOC-2 reports can be difficult to obtain and may require a nondisclosure agreement (NDA). Other documents that may assist in determining a CSP's environmental design sufficiency would be business continuity and disaster recovery plans.

Multivendor Pathway Connectivity

Connectivity is critical in a cloud computing environment. The ability for a customer to remain connected to all cloud resources requires planning. The issue is rarely with the Internet, but instead is with the connectivity to the Internet of the customer or of the CSP. The solution often requires the use of multiple ISPs or multiple vendors providing connectivity.

These two issues are connectivity from the customer to the Internet and the connection of the CSP to the Internet. The concerns are similar but independent. A communication failure at the customer's end would impact that single tenant while leaving the vendor and all other tenants unaffected.

The cloud customer should consider multiple paths for communicating with their cloud vendor. In an era of increasingly dispersed workforce, often working from locations separate from the business network, strategies to keep the workforce connected to the Internet and the cloud vendors must be developed and tested. The solution often requires multiple vendors.

The vendor must also develop multiple connectivity strategies. If the vendor cannot access the Internet, it does not matter whether all the tenants can because they will be unable to access their cloud resources. The customer also needs a strategy for times that the vendor is unavailable. How will the customer continue to execute critical business functions? One way is to have multiple vendors for critical functions.

For example, if enabling the dispersed workforce to communicate and collaborate is a critical business function, multiple methods can be developed and tested. A simple example is with online meetings.

ANALYZE RISKS ASSOCIATED WITH CLOUD INFRASTRUCTURE

Any data center or cloud infrastructure has risks, whether it's on-premise or hosted by a third party. Many organizations are moving from an on-premise to a cloud-based infrastructure and must properly consider the risks associated with this move. A cloud-based infrastructure is not less risky; it is differently risky. A move to the cloud must consider these risks and do a cost-benefit analysis to ensure the cloud is the right move. In the end, after analyzing the risks, many organizations will end up with a hybrid environment with some local infrastructure for some processes/solutions and one or more cloud environments for other processes/solutions.

Risk Assessment and Analysis

The risk analysis of any CSP or cloud solution involves many departments. These include business units, vendor management, privacy, and information security. The new risks with a cloud solution are mostly associated with privacy and information security. There are some key issues when conducting a risk assessment for a CSP or cloud solution. Some major issues will be discussed.

Authentication is a key question. Will the cloud solution provide authentication services, or will the cloud solution be authenticated by the customer. If using the cloud solution's authentication, it is unknown if the cloud provider's authentication is secure. As users have the tendency to reuse passwords, a breach of the cloud service's authentication server may also provide the information necessary to breach your on-premise systems.

If the customer provides their own IAM system, it may be accomplished through a SaaS IAM solution, or through federation with the customer's on-premise IAM manager. Each solution has pros and cons. For example, if using a SaaS IAM system, users are likely to use the same username and password for both the cloud and on-premise IAM systems. However, the SaaS IAM has been designed for the cloud and cloud security needs. If federating the cloud service to the on-premise system, cross-connection between the two environments may create a security concern as well. In either case, user education is an important component to the cloud IAM strategy.

Data security is a concern. How a vendor encrypts data at rest is important. In addition, the method used to transfer data to and from the vendor and between the vendor and any third-party services used by the vendor must be investigated. The data remains the responsibility of the customer even when stored on the vendor's system.

It is important to assess any risk posed by the vendor's policies and processes. These include the vendor's privacy policy, incident response process, cookie policies, information security policy, etc. You are no longer assessing only your organizational policies but also the policies of the organizations with whom you are doing business.

For some key systems, it is important to assess the support provided for incident response. This includes an assessment of logging support, vulnerability scans of the vendor, application vulnerability scans, and external assessments of the vendor being considered or currently used.

Many vendors providing services will have a SOC-2 or SOC-3 report. The preferred report is a SOC-2 Type-2 report. Accessing this report, if possible, will usually require an NDA. However, the assurance this type of report can provide is worth an NDA. Other useful attestations are ISO 270017 (Cloud Security) and ISO 27018 (Privacy) in particular and ISO 2700X certifications in general, FISMA, and FEDRAMP. Carefully read the cover letter provided by the third-party assessor for an understanding of any special circumstances governing the findings in the report.

Cloud Vulnerabilities, Threats, and Attacks

The primary vulnerability in the cloud is that it is an Internet-based model. Anyone with access to the Internet has the potential to attack your CSP, your cloud provider, or you. For example, if you are an international business or are involved in a business that is not well regarded throughout the world (even if your business is legal in the country where you conduct it), you may be the subject of an attack that creates a potential data breach and/or a denial of service (DoS).

The attack on your CSP or cloud vendor may be unrelated to you, so the typical threats you are prepared for may not cover the full spectrum of potential threats. The threat may be targeting the vendor, may be targeting another tenant of the cloud provider, or be related to threats in the geographic location of the cloud data center. These attacks may come from anywhere in the world. You may simply be collateral damage. Regardless, the end result may be a DoS—even if you are not the intended target.

Other risks come from the other tenants. If protections keeping customer data separate fail, your data may be exposed to another tenant. This tenant could be a competitor or other bad actor. Encryption may be the best protection, with the customer managing their own encryption keys. The customer may also consider not storing their most sensitive data in the cloud.

There can be an additional risk from the cloud vendor. Employees of cloud vendors have been known to exfiltrate customer data for their own purposes. Contractual language may provide the only remedy once detected. Prevention becomes the best defense. As with other tenants, encryption with the customer managing their own keys (separate from the cloud) prevents data exposure.

Virtualization Risks

Virtualization is a powerful tool and, as such, has specific risks. The hypervisor is under the control of the CSP in the shared security model. If the hypervisor is compromised, all VMs on the hypervisor may be compromised. So, the task of the CSP protecting the hypervisor is critical.

VM sprawl is also a risk. Members of the workforce may create VMs for projects and forget to close them down when done. VM sprawl increases the attack surface as these unused VMs may not be actively monitored, so malicious use may go unnoticed. The increase in the overall number of VMs can also balloon costs to the organization unexpectedly.

Another risk with VMs is with the data stored in each VM. Sensitive data and data at different trust levels can exist in each VM unless care is taken to manage and monitor what data may be used and where it is stored. The risk associated with VM sprawl and sensitive data storage is a management issue. If not carefully monitored and managed, organizations can easily lose control of the VMs they own, putting data and budgets at risk.

Countermeasure Strategies

There are a number of ways to mitigate risks in the cloud environment. The start of security is with the selection of the CSP. This is the exercise of due diligence in selection. A careful risk assessment and analysis of CSPs will eliminate some of the riskier players in the cloud space.

Using a trusted CSP is a good start. The next step is the design of systems. Security should be designed in at every step. When using cloud services, the configuration of each service can be an important design step to ensure the most secure configuration.

The next countermeasure is encryption. Encryption should be enabled for all data at rest and data in motion. CSPs provide encryption services. Sometimes the only thing missing is the customer enabling encryption. Data in transit must be protected using TLS, IP security (IPSec), VPNs, or another encrypted transmission method. In addition, limiting the ingress/egress points in each cloud service can enhance monitoring.

Each major CSP provides the ability to manage your secure configuration, monitor changes to cloud services, and track usage. For example, AWS provides Inspector, CloudWatch, CloudTrail, and other tools to assist your management of your cloud environment. However, the best monitoring is worthless without regular and consistent review of logs.

DESIGN AND PLAN SECURITY CONTROLS

The risks associated with cloud computing can be mitigated with the proper selection of controls. This is the same approach used in traditional risk management tailored to the risks associated with cloud computing.

Physical and Environmental Protection

The location housing the physical servers in a cloud environment must consider physical and environmental controls. This may be a CSP for a public or other third-party cloud. For a private cloud, it may be the company employing the cloud.

Site location has an impact on both physical and environmental protections. A data center along the waterfront in an area subject to regular hurricanes or in a location with frequent tornados, flooding, or possibly earthquake activity may be ill advised or may lead to a strategy with multiple data centers in different locations to provide redundancy. If each data center has a different risk profile, they provide greater assurance of availability.

Other physical and environmental requirements are typical in all data centers, including physical cloud data centers. These include the ability to restrict physical access, reliable power and other utilities, and the availability of an adequate workforce.

A cloud data center also has significant network capability requirements. All data centers require network capabilities. But, as a CSP may provide services to a large number of customers over a geographical area that can be worldwide, these requirements may be more substantial for the cloud data center than a single tenant data center on-premise. More than one ISP may improve connectivity redundancy in many scenarios.

The customer has no control over the physical siting of a cloud data center except for a private cloud located on-premise. That does not relieve the customer of responsibility in this regard; it simply changes the customer responsibility. To the extent possible, the customer should review the location of cloud data centers and be aware of the cloud vendor's business continuity and disaster recovery plans. The ability of the cloud vendor or CSP to respond to disasters directly affects the ability of the cloud customer to serve their customers.

System and Communication Protection

There are a number of controls available for system and communications protection. One source for controls is NIST Special Publication 800-53: Security and Privacy Controls or Information Systems and Organizations; specifically, Section 3.18 System and Communications Protection (SC). Similar controls can be found with ISO and other control sets. The SC Controls Set includes 51 specific controls. A few of the major controls include the following:

  • Policy and Procedures: This is a primary control. Policies and procedures (P&P) are a primary foundation to all areas of security. P&P provide the foundation for all security actions by setting the purpose, scope, roles, and responsibilities.
  • Separation of System and User Functionality: This is a core control. Separation of duties is a fundamental security principle and prevents users from altering and misconfiguring systems and communication processes.
  • Security Function Isolation: Just as we want to separate the user and system functions, separating security and nonsecurity functions allows for cleaner interfaces and the ability to maintain the security function.
  • Denial-of-Service Protection: A DOS attack is a primary concern of all communication systems. Preventing a DOS attack involves dealing with bandwidth and capacity issues, detecting attacks, and decreasing pivot risk to prevent one system from attacking another.
  • Boundary Protection: This includes preventing malicious traffic from entering the network, but preventing malicious traffic from leaving your network, data loss (exfiltration) protection, and other boundary needs of the organization.

These are just a few of the controls in the SC Controls Set. As mentioned, there are 51 potential controls in the NIST publication for System and Communication Protection. It is not expected that an organization will implement all 51 controls. This is more of a collection of potential controls to review and consider.

Cloud computing has a shared security model, and which controls are the responsibility of the CSP and which are the responsibility of the customer should be carefully reviewed and understood. For example, the CSP should have a strategy to keep your business functioning in the event of data center disasters and communication disruptions including DOS attacks.

In addition to these controls mentioned, the customer should assure the potential to communicate in the event of an ISP failure by ensuring alternate methods of communicating with the CSP. This can be through dual ISPs or similar strategies to make communication with the vendor possible when a communication interruption occurs.

Protection of business data remains a critical issue. Controls to protect data in transit must be available and utilized. The primary control for data in motion (transit) is encryption using protocols/tools such as TLS, IPSec, or VPNs. This control is supported by a robust identification and authentication system and the creation of authorization mechanisms that enforce access controls to the system, process, and data.

Virtualization Systems Protection

The controls for VMs and virtualization systems start with secure IAM. On its own, IAM is not sufficient for full protection of your VMs. However, controlling access to VMs is your single most important control. If you get IAM wrong, other controls will be less effective. Privileged access must be strictly limited and should enforce least privilege and separation of duty controls.

Creating standard configurations for VM use in your organization will also help protect them. Variability adds complexity and makes monitoring for unauthorized use more difficult. If the organization uses standard patterns, it is much simpler to identify malicious behavior and to enforce configuration policies.

Configuration guides can provide guidance on the use of cloud services. This will assist workforce members new to the cloud in creating secure solutions. For example, an S3 bucket has many configuration options. Which options are recommended or required and how they should be set can be explained. Cloud tools can then enforce the configuration, prevention, or alerts on changes outside of business guidance.

Other controls exist in the cloud environment to assist with monitoring performance, identifying malicious behavior, and enforcing or identifying variance with policy. Many tools are available to provide these controls—including tools integrated with major CSPs and tools that can be added to a customer's cloud environment.

For example, Amazon provides CloudWatch. This tool can monitor EC2 instances and other data storage systems. CloudWatch can also set alarms, store logs, view graphs, as well as respond to changes in AWS resources. The early alerting provided by this tool can prevent small changes from becoming large problems and may alert you to attacks, misuse of resources, and variations of approved customer configurations.

Azure provides a similar tool in Microsoft Cloud Monitoring (MCM). This tool will monitor Azure applications, analyze log files, and alert customers to potential threats. Like CloudWatch, MCM will monitor utilization, performance, and workloads. In both cases, the software is built into the services each CSP provides. Other CSPs provide similar tools.

Finally, the ability to run multiple VMs from multiple locations must find a balance between security and availability. VM usage can get out of control if not carefully managed. If VM usage is not carefully managed, the attack surface for the organization can actually grow over time.

Identification, Authentication, and Authorization in Cloud Infrastructure

Many organizations will want to extend their current IAM system into the cloud. They may also choose a SaaS IAM and use it for both their cloud environment and their on-premise environment. They may also maintain their legacy IAM on-premise and use a separate SaaS IAM for cloud services. This is really a business decision. An SSO environment for the workforce is generally desirable but may not be possible to achieve.

Another option is to use the vendor's IAM system for each cloud resource. Using a vendor's IAM system introduces a number of risks. First, the security of the vendor's IAM system is unknown. If you are relying on the vendor to provide access to your business data, you should have confidence in their IAM system. For some cloud services, this is the only option. Careful vetting of the vendor IAM system would be necessary if not providing your own IAM system.

For each possible solution, user education is important. End users often reuse passwords. An employee may use the same username and password on all work-based systems. If a vendor is compromised, then other corporate systems may be similarly compromised. The cloud vendor may not notify you of a compromise of the IAM for an extended period of time, if they are even aware of a compromise. This puts your business systems at further risk.

Audit Mechanisms

It can be more difficult to audit either a CSP or a cloud vendor. A customer will not have broad access to the physical security of cloud data centers or the vendor's networks. This is both due to logistics and due to the presence of other tenants. A CSP may have a number of data centers that are geographically dispersed. Having a customer audit all physical controls is essentially impossible.

Broad access to the vendor network is also a security risk. If you have access to the network, you may intercept privileged information belonging to another vendor, and multitenancy boundaries could be violated. If the vendor provides you with broad access, they would also provide other customers with the same. This would be a security concern.

Vendors may also not share logs. If they are not properly disaggregated, log entries for other tenants could be exposed. However, using CSP tools allows you to monitor your own resources and provide alerts tailored to your needs. For example, AWS CloudWatch monitors operation data and can be used to create logs and generate alerts.

Log Collection

One common problem with logs is the vastness of data collected in logs. If using cloud services it is important to tune the events logged to those that matter. You may set the logging broadly, and as you become familiar with normal operations, you can tune to identify those activities most essential to log. Even without tuning, changes to privileged user accounts should always be logged and alerted. Privileged accounts do change, but it should be authorized, usually in writing. Any changes made using a privileged account should also be logged.

A log aggregator can ingest the logs from all the on-premise and cloud resources for review in the SOC. These logs must be monitored daily with privileged account alerts getting immediate attention. If not reviewed, logs become much less valuable. Log collection will remain valuable for incident response and audit purposes. But without regular and consistent log review, the value of log collection is not fully realized.

Packet Capture

To resolve many security issues and truly know what is going on within your network, packet capture is necessary. However, packet capture on a cloud vendor or CSP using traditional tools like Wireshark is not generally possible. Such packet capture would require access to a vendor's internal network beyond what would be permissible. A vendor must protect the data of all tenants, and allowing one or more tenants to perform packet capture can put other tenants at risk. A vendor may perform packet capture and could make it available for incident response or audit purposes in some circumstances. But this should not be expected and would need to be explicitly agreed to in vendor contracts.

To address this security concern (the lack of packet capture in general), some CSPs provide tools that allow packet capture functionality to some degree. Two examples will be discussed with AWS and Azure. If the customer is using a hybrid cloud, where some portions are on the customer's network or in the customer's data center, packet capture of those segments is possible with traditional tools.

Amazon provides AWS VPC Traffic Monitoring. This tool allows a customer to mirror the traffic of any AWS network interface in a VPC they have created and to capture that traffic for analysis by the customer's security team. This can be done directly in AWS using CloudShark to perform network analysis on the packet capture. This tool creates what is essentially a virtual tap. In this way, the customer can monitor all network traffic through their VPCs.

Microsoft provides a similar capability with Azure Network Watcher. This tool allows packet capture of traffic to and from a customer's VMs. Packet capture can be started in a number of ways. One nice feature for security purposes is the ability to trigger packet capture when certain conditions exist.

The specific tools available and the use of these tools will change over time. For security purposes, the ability to capture packets in the cloud services used by the customer is important. This is an area that should be fully investigated when moving to the cloud.

PLAN DISASTER RECOVERY AND BUSINESS CONTINUITY

The cloud has transformed both disaster recovery (DR) and business continuity (BC) by providing the ability ability to operate in geographically distant locations and by providing greater hardware and data redundancy. All of this leads to lower recovery time objectives (RTOs) and recovery point objectives (RPOs) at price points businesses could not achieve on their own. DR and BC can be planned into the system rather than trying to bolt it on after the fact in largely unsatisfactory ways.

Risks Related to the Cloud Environment

There are risks to cloud computing. The first one that generally comes to mind is that the business no longer owns or has full control over system hardware assets. Computing becomes an operational expense (OPEX) rather than a capital expense (CAPEX). This change in ownership and control makes many people uncomfortable. However, with careful selection of CSPs and the development of SLAs and other contractual agreements, these concerns can be addressed. In addition, the service model used affects the amount of control that is retained. For example, in an IaaS environment, the business retains a large amount of control. It may be determined that the control a business wants is not necessary to ensure business continuity and to address potential disasters. The decrease in cost for DR and BC may also mean greater ability to respond to disasters.

The geographic dispersion of the CSP data centers may also mean that the disaster risk profile may be unknown to the customer and may be very different from their own. For example, the CSP data center may be in an area subject to hurricanes, while the customer is located in an area far from the coast. This risk requires the cloud customer to review the ability of a CSP to address disasters relevant to their locations. One reason it is important to use a large CSP having multiple regions is that the risk profile will be different in each region, providing greater continuity benefits. One CSP data center may be in an area subject to ice storms, and another is in a more temperate location. This allows the CSP to move a customer from one area to another when natural disasters are more likely in one area than another area if the contract permits this.

Downtime is a concern for companies, and Internet-based CSPs are not immune. It is also an issue that can affect a company not using the cloud. If the Internet is down, many companies will be unable to do business. Fault tolerance can be built in with a CSP using availability zones and automatic failover. If ISP connectivity is the concern, the major CSPs provide direct access, such as AWS Direct Connect or Azure ExpressRoute. If fault tolerance is designed into the services a company delivers, localized connectivity failures will only impact those in the region impacted and not customers in other regions.

Compliance can be another risk in the cloud environment. Some regulations have strict rules on where PII, PHI, and other sensitive information can be stored. If a company is not careful, they may violate these restrictions. This is an area that contracts must address. Fortunately, the CSP has many customers, and legal requirements will be common among them. The CSP will need to address compliance concerns for many or all of their customers. Cross-border transfers of data are of particular concern in many areas, such as European Union (EU) countries covered by the General Data Protection Regulation (GDPR).

Another compliance concern may be availability of data. In a disaster scenario, a company may still need access to critical business processes and data even if the systems are down. Procedures for accessing this data in alternate ways (other than through the CSP) must be planned for and tested regularly. Regardless of the disaster, responsibility for the data remains with the customer, so careful selection of CSPs and alternative access methods will need to be addressed.

There is also some concern that multitenancy boundaries can be breached, APIs can be exploited, and the hypervisor may become compromised in a disaster scenario if safeguards fail. Disasters can affect more than availability. They can impact integrity and confidentiality as well. These issues can affect the ability of a business to function. Each of these must be addressed. A BC and DR plan cannot simply point to the cloud for every potential disruption. Instead, comprehensive plans are still necessary and must be developed and tested.

One risk-reducing feature that makes the cloud particularly beneficial for BC and DR is that the Internet and geographic dispersion means that the customer processes are available, widely leading to a highly available architecture. Both highly available because the network is global and highly available because all of a major CSP's availability zones will not be directly affected by the same disaster. Disasters such as storms, terrorist attacks, and other infrastructure failures are usually geographically localized.

Business Requirements

Business-critical systems require more recoverability than is often possible with local resources and corporate data centers without expending enormous resources. The cloud environment provides options to support high availability, scalable computing, and reliable data retention and data storage. Three ways we measure the business capabilities are RTO or how long are you down, RPO or how much data may you lose, and recovery service level (RSL), which measures how much computing power (0 to 100 percent) is needed for production systems during a disaster. This does not usually include compute needs for development, test, or other environments. These are usually nonessential environments during a disaster.

Recovery Time Objective

RTO is the amount of time in which a business process must be restored to a specific service level. Missing the RTO can lead to contractual violations and business consequences. The RTO is usually measured in minutes or hours. However, for some business processes, the RTO can be days. One of the authors of the book once worked for a company that had a 72-hour RTO for the network to be backed up and the data backup to be restored. Clearly, this customer did not have the transactional load of an Amazon or eBay type of business and could catch up in a reasonable period of time. The company had a manual process during that 72 hours, and then manually entered the transactions after the backup was restored.

At the end of the day, RTO is a business decision and not an IT decision. The role of IT is to support the business with options and costs. The business needs full information on options and costs to make the best business decision. Once the decision is made, the role of IT is to implement the decision and to make every effort to meet the business RTO.

Recovery Point Objective

The RPO is a measure of the amount of data that the business is willing to lose if a disaster or other system stoppage occurs. Once the business makes this decision, you have the appropriate RPO, and a backup frequency can be selected. If the business is willing to risk losing 24 hours of data, the systems need to be backed up daily. If the business determines that a potential loss of one hour of transactions is acceptable, an hourly backup is needed. If the business maintains paper copies of transactions between backups, these transactions may be recoverable with time and effort. For an Amazon or eBay type of business, a manual process is not a practical solution. The number of transactions per minute are simply too large for these major online businesses. Other methods to maintain data continuity in the event of a disaster will be necessary.

The RPO is usually measured by the amount of time during which the business chooses to risk losing transactions. The RPO could be the number of transactions that may be lost during a disaster. Either way, the RPO is tightly coupled with system backup strategies and/or redundancy strategies. With a seamless failover in a cloud environment, the RPO can essentially be zero (either zero seconds of lost transactions or zero transactions lost) for all but the most catastrophic events. For organizations with a high level of transactions, the ability to seamlessly failover can be essential. In a cloud environment, maintaining a copy of data in another region (a mirror, for example) can support an RPO of near zero. If multiple regions fail at the same time—a truly catastrophic event—the ability to maintain business processes may not be the primary consideration.

The customer is responsible for determining how to recover in the case of a disaster. The customer can use backups, availability zone, load balancers and other techniques to provide disaster recovery. A CSP can help support recovery objectives by not allowing a data center to have two availability zones. Otherwise, if you are using two zones and they are in the same data center, a single disaster can affect both availability zones. Cloud services can also provide the monitoring needed for this high availability. Other major CSPs provide similar capabilities. It is important to note that while the major CSPs provide these capabilities, a customer must choose to use and properly configure these capabilities. They do not come automatically and often have additional cost.

Recovery Service Level

RSL measures the compute resources needed to keep production environments running during a disaster. This measure (0 to 100 percent) gives an indication of the amount of computing used by production environments when compared to development, test, and other environments that can be shut down during a disaster. During a disaster, the focus is on keeping key production systems and business processes running until the disaster is largely resolved and other activities can restart to pre-disaster levels.

Business Continuity/Disaster Recovery Strategy

The difference between a business continuity plan (BCP) and disaster recovery plan (DRP) will drive the strategy. A BCP may be executed following an event that falls short of being a disaster. A BCP will always be initiated following a disaster. The BCP's purpose is to keep business running after an event, especially a disaster. The BCP may use different locations or use different systems or processes.

For example, a BCP may be initiated following an unscheduled disruption of utility services. If the power utility must shut down a facility for an unscheduled upgrade, the BCP can guide how the business will proceed. One company the author worked for had a failing electrical system at the point utilities were provided to the business. The company scheduled a repair with the utility provider at a convenient time. However, the system suffered a complete failure prior to the scheduled maintenance, leading to the invocation of the BCP. The BCP's purpose is to keep business running in another location or using different systems or processes until such time that the power returns. In this chapter, the BCP will be part of the response to a disaster.

While a BCP keeps business operations going after a disaster, the DRP works on returning business operations to normal. For example, a tornado may make a facility unusable. The BCP will move operations to alternate sites. In a cloud environment, this may mean bringing additional availability zones online to ensure high availability. It may also involve relocated key personnel and the use of cold/warm/hot sites for the workforce.

The purpose of the DRP will be to restore operation in the destroyed building or new permanent facilities. Once systems are restored to their original facilities or new facilities, the disaster recovery effort is over. Restoration of systems includes normal operations, new development, regular maintenance, and other support functions.

Another way of looking at this is the BCP is concerned with critical business processes and keeping them running. The DRP is focused on returning operations to normal, which includes infrastructure and facilities. In a cloud environment, normal operations will be defined as returning to pre-disaster levels of function and not simply continuation of critical business functions.

The cloud supports the BCP/DRP strategy. The high availability of cloud services provides strong business continuity and can serve as a core part of the BCP strategy. In effect, the cloud allows the business to continue regardless of disasters affecting the customer's business facilities. The high availability of cloud services also impacts DRP strategy. As the cloud can provide resilient services at attractive price points, a company may focus more on resilient services that survive a disaster rather than processes to recover from disasters.

Creation, Implementation, and Testing of Plan

There are three parts to any successful BCP/DRP. Each is vital to the success of the organization. These include the creation, implementation, and ongoing maintenance and testing of the plans.

Because of their inherent differences, BCPs have different responsibilities within the business. Sometimes the BCP is seen as part of the DRP. However, it is really an independent plan that supports the DRP. If the business does not continue to function, the responsibility of the BCP, there is no reason to have a DRP as there will be nothing to return to normal. The BCP supports the DRP.

The plans should be developed with knowledge of the other as they must work together. These plans will include many of the same members of the workforce. But they are separate plans.

Plan Creation

The first step in any comprehensive BCP is to do a business impact analysis (BIA). The BIA identifies the impact of process/system disruption and helps determine time-sensitive activities and process dependencies. The BIA identifies critical business processes and their supporting systems, data, and infrastructure. The BIA is essential to determining requirements and resources for achieving the RTO and RPO necessary for business continuity.

The systems can be grouped in a number of ways. One way is to identify critical processes, important processes and support processes, and the systems and data that support these processes. Critical business processes are those that impact the continued existence of the company. Email, for example, is rarely a critical business process. For a company like Amazon, inventory, purchasing, and payment may be critical processes. The selection of a critical process is a business decision.

Along with the selection of critical processes is a prioritization of systems. If you can bring up only one system at a time, which is first? Some processes will not work until other processes are back online. In that case, they are prioritized after the processes they depend on. There is no point in bringing a process online if it cannot function until other processes are available. Because of this, identification and authentication services are often among the first processes to restore. All critical processes must be prioritized over other processes.

After critical processes are other important processes. These are processes that are a value-add and can have an impact on the business profitability. Once again, these must have a prioritization or ordering.

After the important processes are online, other useful and beneficial processes are brought back online. It is possible that a business decision may be made that chooses to not restore some processes until restoration of normal operations, if they are not critical to the business.

Processes based on legacy systems can be difficult to return to pre-disaster configurations. The technical debt of continuing to run legacy systems unless essential to the business may make them expendable. If there are systems that may be in this category, it may be worthwhile to consider replacing and retiring them prior to an emergency. Legacy systems often have legacy hardware and software requirements that are not easily restored after a disaster. Legacy systems may also have information security risks that are best avoided. If the legacy system supports a critical business process, the need to replace the legacy system may be urgent.

One advantage to a cloud-based BCP/DRP is the expectation of a modern and up-to-date infrastructure. While a legacy system could potentially be hosted in the cloud, the move to a cloud position may provide the opportunity to modernize.

The creation of the DRP often follows the BCP. Knowing the critical business processes and the supporting infrastructure provides a roadmap to returning the system to pre-disaster operations.

Both the BCP and DRP should be created considering a range of potential disasters. History is a good starting point for environmental disasters. Government agencies such as FEMA and private/public partnerships like InfraGard can also provide guidance on the likely disasters that need to be considered in any geographic location.

The CSP is also a source for BCP/DRP planning. Recommendations and solutions for surviving specific business disruptions and returning to normal operations will be available from the larger CSPs, such as Amazon, Google, IBM, etc.

The creation and implementation of a DRP or BCP is a lengthy process. It is simplified by having the system criticality and prioritization determined prior to beginning plan development. The identification, criticality, and prioritization of business processes, systems, and data are a necessary first step to creating a complete and functional plan. If this prerequisite step is omitted, the creation of the plan can be disruptive. Often, every business unit and all process owners view their processes, systems, and data as important, if not essential. It is important for senior leadership to make these decisions in advance so that each group knows their place within the plan for bringing the business back online.

BCP/DRP Implementation

As part of the creation of a BCP/DRP, the identification of key personnel is important. The corporate sponsor provides key resources for creating the plan. Implementation of the BCP/DRP can include identifying alternate facilities, contracts for services, and training of key personnel.

The BCP/DRP identifies critical processes. To implement the plan, a customer may implement critical services in the cloud to take advantage of multiple availability zones, automatic failover, and even direct connection to the CSP. These choices come with a cost. The cost of high availability in the cloud is generally less than a company trying to achieve high availability on their own. In addition, the cost of building resiliency may be far less than the cost of business interruption. By identifying the critical business processes, a business can also avoid the cost of implementing high availability for noncritical systems. In many conversations on availability with business process owners, it has become clear that everyone wants high availability. But, once the costs associated with that decision are understood, few continue to believe that they have a requirement for high availability.

Critical business processes can be supported through component/data center redundancy and may run as an active-active configuration. This allows near instantaneous continuation of critical processes. Important but noncritical business processes may use the less expensive active-passive configuration. This allows a more rapid restoration of services when a specific region or zone becomes unavailable. Less important business processes may be run in a single region or zone, may take more time to return to service, and may operate at a lower cost.

Methods to provide high availability and resiliency continue to evolve. The ability to automate monitoring and deployment through orchestration and other methods supports high availability in the cloud. New tools and methods will continue to be developed that should lead to an ever-resilient cloud environment at attractive prices.

Implementing a BCP/DRP will also set the schedule for training and testing. Generally, plans should be reviewed and tested at least annually.

BCP/DRP Testing

A BCP and DRP should be tested at least annually. This test should involve the key personnel needed for all disasters. In addition, many scenarios will involve members of the workforce not directly involved in the creation and implementation of the plans.

There are some basic scenarios that apply to most if not all businesses. It is not necessary to test all of them each year. But a robust plan should test all likely scenarios over time. A well tested plan will function even for an unexpected disaster. Common disaster scenarios include the following:

  • Data breach
  • Data loss
  • Power outage or loss of other utilities
  • Network failure
  • Environmental (e.g. fire, flooding, tornado, hurricane, or earthquake)
  • Civil unrest or terrorism
  • Pandemics

The plan should test likely scenarios and can be tested in a number of ways. The maturity of the plan and the people implementing the plan will determine the type of testing that takes place. There are a variety of standard test methods.

Tests should be both scheduled and a surprise. Particularly for new plans and organization immature in the BCP/DCP space, a scheduled test ensures that key personnel are available and will begin the maturation process. Tests that have the potential of being very disruptive should also be scheduled to minimize disruption.

Eventually, some tests should be unexpected. Surprise tests do not mean unscheduled. Instead, only high-level approval and a few key individuals are aware of an upcoming test so that the organization can be tested in more realistic ways. The high-level approval is essential in case some amount of unexpected business disruption occurs. Executive approval includes a review of potential disruption and benefits so that a business decision can be made by those responsible for the business. Key personnel who are part of the test planning can be positioned in advance of the test in order to monitor performance metrics.

The simplest test method is the tabletop. A tabletop is usually performed in a conference room or other location around the “table.” Key personnel are presented with a scenario and then work through the plan verbally. This is usually the first step for a new plan. It identifies missing pieces in the plan or steps that need greater detail. The next step may be a walk-through where key personnel move to the appropriate locations and verbally verify the steps, sometimes even performing some parts of the plan. While developing a plan, regular tabletops and walk-throughs can help flesh out a more robust plan. A tabletop or walk-through can also be useful for new members of the team to assist them in identifying their responsibilities in the plan.

A more substantial test is a simulation. Like a fire drill or a shelter-in-place activity, a disaster is simulated. The plan is exercised while normal operations continue. More robust and detailed than a walk-through, those performing certain steps are involved in simulating their response to a disaster.

The next level of testing is a parallel test. Care must be taken not to disrupt normal business operations if possible. In a parallel test, key personnel and workforce members perform the steps needed in case of a disaster. More than simulating the steps, they actually perform the steps to ensure that they can accomplish the critical business processes if the existing systems were disrupted by a disaster. It is parallel because the critical systems continue to run, and some or all of the data is also run at the alternate site. It is then possible to compare the results of the alternate methods to the critical systems to determine any gaps in capabilities.

The most robust level of testing is a full cutover test. In this test, the disaster is simulated in full. The primary system is disconnected, and the business attempts to run critical functions. This is a high-risk test, as the critical functions may fail. Only the most mature organizations and plans should attempt a full cutover test.

SUMMARY

Moving to the cloud is a business decision. Moving to the cloud changes capital expense to operational expense and can add high availability and resiliency at a price that is attractive to businesses of all types. It also provides capabilities not available to all businesses. To achieve these benefits, a customer must have an understanding of the key infrastructure pieces of a cloud environment. The customer must also understand the shared security responsibility between the CSP and the customer. Finally, a customer must understand how to properly configure and use the cloud resources they purchase to ensure appropriate controls are in place to secure the process and data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.174.248