Chapter 7

Domain 5: Cloud Security Operations

IN THIS CHAPTER

Bullet Identifying security configuration requirements for your hardware

Bullet Learning how to manage virtualized environments

Bullet Understanding how to securely operate and manage physical and logical resources

Bullet Aligning with operational controls and standards

Bullet Conducting digital forensics investigations in the cloud

Bullet Communicating important details to stakeholders

Bullet Learning how to run security operations for your environment

In this chapter, you focus on securing the physical and logical cloud infrastructure from beginning to end, which includes everything from system hardening to a discussion of how tried and true security technologies like firewalls fit into cloud environments. Domain 5: Cloud Security Operations covers a lot of the tactical components of cloud security and represents 17 percent of the CCSP certification exam.

Cloud security operations is all about the day-to-day tasks necessary for you to protect the confidentiality, integrity, and availability of cloud systems and the data that reside on them. In this domain, I cover planning, designing, implementing, and operating your cloud’s physical and logical infrastructure. I also explore various operational controls and standards and discuss how to communicate with relevant external parties.

Implementing and Building a Physical and Logical Infrastructure for Cloud Environment

Securely operating a cloud environment begins with implementing and building your physical and logical infrastructure with security in mind. This step starts with careful planning and implementation of physical components, such as servers, networking equipment, and storage devices, and also requires thorough consideration of the logical infrastructure aspects that separate cloud environments from traditional data centers.

Hardware specific security configuration requirements

As with traditional data centers, a cloud data center’s infrastructure begins with its physical hardware — all the servers, networking devices, and storage components that you’d find in the on-premise world. While these environments have many similarities, there are some unique points to consider when implementing and building a cloud infrastructure. There are many types of hardware components, each with countless combinations of configuration requirements and settings. This variety becomes even more pronounced in a cloud environment due to the sheer size of cloud infrastructures and the often diverse set of hardware devices that need to be configured and secured.

Basic Input Output Systems (BIOS)

Basic Input Output System, or BIOS, is firmware that initializes and tests computer hardware before booting the operating system. BIOS is the first piece of code to run on a machine when it is first powered on. All physical computing devices, including those used in cloud infrastructures, have BIOS settings that configure the hardware and manage the data flow between the device’s OS and other components.

Because it is the first code a computer sees, the BIOS is incredibly powerful and important to protect. A successful compromise of the BIOS can be difficult to detect and have a major impact on your system’s security. In cloud environments, the BIOS can impact the functioning and security of the hypervisor, Trusted Platform Module (discussed in the next section), and other critical virtualization hardware. It’s important that the CSP tightly restrict access to BIOS settings and limit the ability for settings to be accessed and/or modified by unauthorized parties. The principle of least privilege should be followed, and only a select few privileged CSP personnel should be able to access the BIOS of any system. Even still, this limited access should be closely monitored for any malicious misuse or accidental damage. Each hardware vendor has its own methods for BIOS management. CSP security personnel should check with the vendor for any recommended BIOS security and general physical device management recommendations.

I should note that most modern computing devices replace the traditional BIOS with its successor, the UEFI. Unified Extensible Firmware Interface (UEFI) is a backwards-compatible specification that improves upon legacy BIOS functionality and security. UEFI can theoretically address hard drives with capacities up to 9.4 zettabytes (1 zettabyte is equal to 1 billion terabytes — so, yeah, that’s a lot), supports systems with more than four partitions, enables faster booting, and other modern computing features. Despite these improvements, most systems that run a UEFI still refer to its configuration as BIOS settings.

Virtualization and trusted platform module (TPM)

A trusted platform module (TPM) is a microcontroller (computer chip) that is designed to provide hardware-based security functions to a system. A TPM is a secure crypto processor, which is a dedicated chip that carries out cryptographic operations. In other words, a TPM securely stores credentials in the form of passwords, certificates, or encryption keys and can be used to ensure that a platform or system is what it claims to be (in other words, it provides authentication). The chip is physically secured to make it tamper-resistant and logically secured to protect against software interfering with its security functions. By being hardware-based and physically secured, a TPM can help protect against compromised operating systems, privileged account misuse, and other software-based vulnerabilities.

Tip Trusted platform module was standardized by the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) in 2009 as ISO/IEC 11889. Refer to that standard for additional architectural and security information related to TPMs.

A trusted platform module can be used for a variety of security-related functions. Some common uses include

  • Platform integrity: A TPM performs checks to ensure that a platform (whatever computing device you’ve secured) boots into a known-good state and is behaving as intended. When a platform is powered on, the TPM validates that a trusted combination of hardware and software is in place before fully booting the OS. A UEFI uses this feature, which is considered establishing a root of trust.
  • Drive encryption: Full disk encryption (FDE) tools like BitLocker leverage this functionality to protect disk encryption keys. When encrypting a drive or disk, you can seal the encryption keys within a TPM. Doing so ensures that the drive cannot be decrypted unless the operating system has securely booted into a known-good state. In other words, if an attacker compromises your OS, it’s not able to regain control over your system after a reboot because the compromised OS will not pass the integrity checks of the TPM.
  • Password protection: Passwords that are stored and authenticated using only software mechanisms are prone to dictionary attacks and other software-based vulnerabilities. By storing passwords in a TPM, passwords are protected from such attacks and allow more secure authentication.

You may be thinking that the concept of a trusted platform module sounds amazing, but maybe you don’t quite see how it can help secure cloud environments or cloud customers. Well, for one, many CSPs leverage TPMs to protect their critical infrastructure from unauthorized changes and to ensure that their complex infrastructures maintain a high level of integrity as things rapidly change. Even more exciting, many CSPs now provide customers with additional security by providing virtual trusted platform modules that perform the same functions as hardware-based TPMs using software.

I introduce the topic of virtualization and discuss hypervisors in Chapter 3. As you learn throughout this book, cloud computing’s reliance on virtualized hardware often requires security features that are similar to traditional IT, but slightly nuanced and tailored to cloud environments. A virtual TPM (vTPM) is provided by the hypervisor and brings the security goodness of physical TPMs to virtual machines and guest operating systems. vTPMs rely on the hypervisor to provide an isolated environment that is segregated from the actual virtual machines in order to secure the TPM code from the VM’s OS and applications. When properly implemented, a vTPM is able to provide the same level of security as a hardware or firmware-based TPM.

While a physical TPM is able to store its secrets in what’s called nonvolatile secure storage (basically a hardware-based storage unit), a virtual TPM does not have that luxury. Instead, when a virtualized component writes secrets to a vTPM, it actually stores it in a file that is separately encrypted at rest, outside of the virtual machine, to provide its own form of tamper-resistance. Additionally, CSPs must encrypt this sensitive information in transit between machines, to ensure comprehensive protection of sensitive vTPM data.

Storage controllers

A storage controller (also sometimes called a disk array controller) is a device that manages and controls storage arrays. It virtually integrates multiple physically storage units into a single logical pool of storage and dynamically determines where to write and read data.

Many storage controllers are equipped with security features, such as controller based encryption. With this technology, the storage controller encrypts data at the controller-level before being sent to disk, which helps protect the confidentiality and integrity of data that moves across the storage controller. Available security features and options vary depending on the vendor of your particular storage controller. Cloud security professionals should evaluate all vendor-specific security features and determine which are suitable for their cloud environment. It’s also important to consult vendor documentation for recommended best practices in implementing and configuring storage controllers securely.

For virtualized environments, storage controllers and associated traffic should be segregated on its own network on the data plane, which allows designated security controls to be applied to the highly sensitive operations of reading, writing, and control data flow. Whenever storage controllers offer the built-in encryption functionality discussed earlier in this section, you should use it. When built-in encryption isn’t supported, then it becomes important to use other compensating security controls (like network segregation) to protect data in transit through the storage controller.

Storage controllers may operate with one of several interfaces and protocols. Cloud environments often rely on the iSCSI protocol for virtualized, network-based storage. iSCSI stands for Internet Small Computer Systems Interface and is an IP-based storage standard that enables the use of SCSI over TCP/IP networks. iSCSI essentially enables block-level data storage over LANs, WANs, or the public Internet — giving virtualized storage the look and feel of physical storage via Storage Area Networks. iSCSI supports security features, such as the Challenge-Handshake Authentication Protocol (CHAP) and Kerberos (for authentication) and IPsec (for encrypted communications). It’s important that cloud security professional use these and other security mechanisms to secure storage controllers and their associated protocols.

Network controllers

A network controller is a centralized point of control used to configure, manage, and monitor a physical and virtual network infrastructure. A network controller assists in automated management of vast networks and eliminates the need for manual configuration of network components. In cloud environments, network controllers are an integral part of Software-Defined Networking (SDN), which I discuss in Chapter 5.

Installing and configuring virtualization management tools

From hypervisors to Software-Defined Networks, and everything in between, virtualization is the backbone of cloud infrastructures. Virtualization management tools interface with virtualized components as well as the underlying physical hardware to oversee and manage the operations of the virtualized environment. With such a heavy reliance on virtualization, cloud security professionals must ensure that careful attention is given to securely installing and configuring these management tools. The tools that manage the hypervisor and management plane are incredibly sensitive and must be protected at all costs! Successful attacks against virtualization management tools can compromise the confidentiality and integrity of cloud user data or even jeopardize the availability of the entire infrastructure.

There are many types of virtualization technologies, and even more virtualization vendors that supply them. Each vendor generally has its own tools and guidelines for managing and securing their products. As with any third-party product, you should first consult your particular virtualization vendor’s documentation and support team for installation and configuration guidance. In my experience, the best cloud providers form strong relationships with their vendors, and I’d recommend that any organization with a highly virtualized environment establish similarly strong relationships with their vendors. If possible, you should form these relationships early in the development of your cloud solution and leverage your vendors’ guidance throughout the entire development lifecycle.

Virtualization management tools, when properly configured, provide you with greater awareness and automation of your virtualized environment:

  • Awareness: Configure your management tools to capture a unified view of your entire virtualized environment and all associated components. Using virtualization management tools provides the unique ability to see and understand your whole environment at a glance.
  • Automation: Configure your virtualization management tools to automate standard infrastructure management tasks, including workload management, performance monitoring, and resource optimization. These tools can be used to automate the less strategic tasks so that your IT resources can focus on more impactful and innovative projects.

Aside from properly installing and configuring virtualization management tools, it is crucial that you take a comprehensive approach to the security of these tools. Vendor guidelines go a long way to ensuring a secure installation and configuration — but, that’s only the beginning. Effective security of these tools generally follows best practices for effective security of any highly sensitive system.

Starting with our friend, the principle of least privilege, you should ensure that the fewest people possible have access to your virtualization management tools. Such high-impact access should be limited to personnel with an absolute business need for such access, and they should be sufficiently background checked to help ensure their suitability for such roles. You may want to rethink giving that intern access to your virtualization management tools (sorry, interns!). For users with a proper business need, RBAC and strong authentication mechanisms are a must. Use multifactor authentication. Period.

Network isolation is a major control in safeguarding your virtualization management tools, and all virtualization management should occur on its own segregated management network. Segregating your critical management tools from customer environments ensures that vulnerabilities in a tenant environment cannot affect other tenants (by pivoting through the management tools) or even impact your entire infrastructure. I discuss network management further in Chapter 5.

In addition to these preventative mechanisms, make sure that you thoroughly log and monitor all access to these management tools, including logging access denials, which may alert you to brute force attempts or other foul play.

Virtual hardware specific security configuration requirements

Configuring virtual hardware requires awareness and consideration of the underlying physical hardware. For example, you’d have a problem if you had a physical host with 100TB of storage, but you configured your virtual hosts to use a total of 150TB of storage. This trivial example demonstrates just how in sync the virtual and physical worlds must be. Each vendor generally documents a set of required and recommended configuration settings for their virtualization products. As the cloud provider goes through the process of generating and configuring virtual images for customer use, it’s important that they consider those requirements in order to ensure functional and secure management of virtual resources for all cloud customers.

Network

In a cloud environment, physical networking devices and a virtualized networking infrastructure support multitenancy. While physical network security controls (discussed in the next section) are important, they are not sufficient in protecting your virtual hardware and the data that moves across your virtual networks. Appropriate virtual network security requires that you have designated virtualized tools — virtual firewalls, network IDS/IPS, logging systems, and so on — in place to monitor and manage your virtualized network hardware. You should be using tools that are purpose-built for virtualized networks and configure these tools to have proper awareness of relevant underlying physical components (physical switches, for example). A strategy that includes physical and virtual network security provides comprehensive data protection at the network layer.

Storage

Secure configuration of virtual storage hardware again starts with adhering to vendor-specific recommendations and configuration guidelines. In addition, you should follow many of the security practices throughout this chapter and the rest of the book: Ensure strong credentials (and no default credentials) are in use, apply MFA for privileged access to virtualized storage clusters, disable unnecessary protocols and APIs, and make sure that you log and monitor all access to your virtualized storage hardware.

Installing guest operating system virtualization toolsets

Most of this section is focused on the virtualized hardware that serves as the underlying host for tenant VMs. The other side of this coin focuses on the virtualization toolsets that enable the guest OS to be installed and run from a VM; this toolset is essentially a set of technologies and procedures provided by an OS vendor that allow it to run in a virtualized environment rather than directly on a physical machine.

Virtualized environments can run any type of OS that provides a virtualization toolset (most popular operating systems do). Therefore, the variety of operating systems available to cloud customers is limited mainly by the cloud provider and what they’ve configured. When a CSP chooses to support a given guest OS, they should carefully configure and manage the associated virtualization toolsets in accordance with the OS vendor’s guidelines and recommendations.

Operating Physical and Logical Infrastructure for a Cloud Environment

After you securely implement and build your physical and logical cloud environment, your attention must turn to securely operating that environment throughout its lifetime. Operating a physical and logical cloud environment involves implementing access control mechanisms, securing physical and virtual network configurations, building and deploying hardened OS baselines, and ensuring the availability of all physical and virtual hosts and resources. Many of these guidelines will be vendor-specific, but the specific characteristics of your overall cloud environment must be considered as well.

Configuring access control for local and remote access

When thinking about access control for cloud environments, a lot of attention is usually given to controlling access to customer environments and data and rightfully so — that’s important stuff. While much of this book is geared toward that user-centric access control, I use this section to focus on access control for the underlying cloud infrastructure. Many of the access control principles from traditional data center security apply here — things like limiting physical and logical access to sensitive data, physically securing buildings and rooms that store critical components, conducting personnel screening for data center employees, and so on.

There are three primary mechanisms to consider when securing local and remote access:

  • Keyboard video mouse (KVM)
  • Console-based access
  • Remote desktop protocol (RDP)

All of these access methods should require MFA and thorough logging and monitoring, at a minimum. You explore additional configuration recommendations in the following sections.

Tip I discuss the principle of least privilege throughout this book. A set of fairly new concepts, Just-in-Time (JIT) and Just-Enough-Access (JEA), should be used to ensure that only the right amount of access is issued during only the right times. Tools like CyberArk and other Privileged Access Management systems can be used to enforce these concepts and ensure true least privilege.

Secure keyboard video mouse (KVM)

A KVM switch (KVM, for short) is an input/output device that allows a user to access and control multiple computers from a single keyboard, video display, and mouse. These devices provide physical access to devices that can be located in multiple, separate physical locations.

In cloud environments, where a single administrator may access multiple devices of different security levels, a secure KVM can be used; secure KVM switches are designed to ensure data isolation between the systems connected to it.

Some basic requirements and features of a secure KVM include

  • Isolated data channels: A secure KVM must make it impossible for data to be transferred between its connected devices — one KVM-controlled server should not be able to access another server’s data through the KVM.
  • Restricted I/O function: A secure KVM restricts USB and other input/output functionality to keyboards and mice. Other I/O devices like USB storage drives should be actively detected and prohibited in order to prevent insecure or unauthorized data transfer.
  • Pushbutton control: A secure KVM should require that a physical button be pressed in order to switch between connected devices. Pushbutton control requires physical access to the KVM and helps prevent remote compromises.
  • Locked firmware: The firmware of a KVM should be impossible to reprogram. Failure to securely lock KVM firmware can lead to an attacker altering the logic or operation of the KVM.
  • Tamper-proof components: The components within the KVM should be securely bonded/soldered to the circuit board in order to prevent component alteration or removal.
  • Physical security and intrusion detection: The KVM housing should be physically secured in a manner that makes it difficult to open. Tamper-evident labels should be installed along the device’s enclosure to provide clear visual evidence if it has been opened or tampered with. For added security, a secure KVM should become inoperable if it has been successfully opened.

Console-based access mechanisms

Regardless of vendor, every hypervisor implementation offers console-based access to configure and manage VMs. Compromise of console-based access can allow a malicious actor to achieve dangerous levels of control over the hypervisor and all hosts that reside on it. As a result, this type of access must be thoroughly protected just as you must secure any other form of hypervisor access. Hypervisors typically come with vendor-specific mechanisms to limit and control console-based access. It’s very important that these mechanisms be fully implemented, monitored, and periodically audited for proper functionality.

Remote desktop protocol (RDP)

Remote desktop protocol (RDP) was developed by Microsoft to allow a user to connect to a remote computer (for example, over a network) and allow that user to interact with the remote computer via graphical interface. RDP requires that a user employs RDP client software, while the remote computer must run RDP server software. Though developed by Microsoft, RDP client and server software are available for Windows, Mac OS, and various flavors of Linux and Unix.

As with any remote access method, RDP must be properly secured and monitored, especially when used to access critical systems. Many versions and configurations of RDP are prime targets for man-in-the-middle, remote code execution, and various other attacks. Ensure that you are actively scanning and patching all systems and make sure to keep an eye out for critical RDP vulnerabilities and patches. RDP should be configured with role-based access controls to limit privileged access, and you should carefully log and monitor all RDP access in your environment.

Warning Remote desktop protocol is considered an insecure protocol and should only be used within closed or private networks. Do not expose RDP over the Internet, without VPN protection, at a minimum.

Secure network configuration

Secure network configuration in the cloud involves the use and configuration of several technologies, protocols, and services. VLANs, TLS, DHCP, DNS, and VPN are some of the key concepts you must understand for the CCSP exam and for real world applications.

Virtual Local Area Networks (VLANs)

A Virtual Local Area Network, or VLAN, is a set of servers and other devices within a LAN that are logically segmented to communicate with each other as if they were physically isolated on a separate LAN. Network isolation and segregation are two of the most pivotal concepts in securing cloud environments. When physical isolation isn’t feasible (as it typically isn’t in the cloud), cloud providers and customers rely on VLANs to enable virtual network segregation. A VLAN acts like a physical LAN, but allows devices to be grouped together regardless of what switch they’re on; as such, you can use a VLAN to virtually group devices that are physically separate, as well as to virtually segregate devices that are physically located on the same switch. When properly configured, VLANs help cloud providers and users logically separate systems with different security and management needs. Common uses of VLANs include, but are not limited to

  • Separating production versus development environments
  • Separating tenant environments from one another
  • Separating different zones, planes, or tiers within your environment (management, application, and data tiers, for example)
  • Segmenting storage controllers from the rest of your environment

Technical stuff In terms of the OSI model, VLANs are a layer 2 (or data link layer) technology.

Transport Layer Security (TLS)

I introduce TLS in Chapter 2, and it’s a pivotal cryptographic protocol that you read about throughout this book. I don’t go into great detail here, but it is worth examining this protocol further.

The TLS protocol replaced the long-used SSL (Secure Sockets Layer) protocol, which has since been deprecated. As such, TLS is now the de facto standard for encrypting network traffic. TLS is made up of two layers:

  • TLS record protocol: Provides the actual secure communications method for data in transit. Ensures data confidentiality by encrypting the connection with symmetric key encryption and validates data integrity through the use of hashes and checksums. Used to encapsulate various upper layer protocols, most notably the TLS handshake.
  • TLS handshake protocol: Allows two parties (like a client and server) to authenticate each other and negotiate the parameters of the TLS connection, including the session ID and encryption algorithms to be used. The entire handshake process occurs before any secure data is transmitted.

As of this writing, TLS 1.3, released in August 2018, is the most current TLS specification.

Dynamic Host Configuration Protocol (DHCP)

The Dynamic Host Configuration Protocol is a network technology that doesn’t always get enough love. Simply put, DHCP is a protocol that assigns and manages IP addresses, subnet masks, and other network parameters to each device on a network. Without this technology, administrators must manually configure network settings on every device under their control. It’s easy to imagine how such manual network configuration would be a challenge in large, distributed cloud environments.

Whereas DHCP was commonly frowned upon in traditional data center environments, it is really a cloud network’s best friend and can help securely automate orchestration and configuration of networks. Because DHCP allows central configuration of critical network parameters, it’s essential that all DHCP servers are tightly secured and controlled. Failure to adequately protect these systems can allow a hacker to manipulate IP addresses and network settings to redirect traffic to malicious or compromised destinations.

Domain Name System (DNS) and DNSSEC

DNS, or Domain Name System, is arguably one of the parent technologies of the worldwide web, and it’s what allows you to enter www.[pickyourfavoritewebsite].com instead of an 8 to 12 digit IP every single time. DNS is a decentralized naming system that translates domain names (like websites) to their IP addresses, and back.

Again, with great power, comes great responsibility. Protecting DNS is paramount in any environment, but particularly in cloud environments that tend to touch so much of the Internet. Hardening your DNS servers is a must, and DNSSEC is a technology that can help prevent DNS hijacking, DNS spoofing, and man-in-the-middle attacks.

DNSSEC is a set of security extensions to standard DNS that support the validation of the integrity of DNS data. In other words, DNSSEC provides DNS clients with a cryptographic signature that validates the origin and authority of a DNS server that responds to a DNS lookup.

Tip The primary attacks against DNS target availability and integrity. DNSSEC is great for ensuring DNS integrity, but does nothing to protect the availability of DNS resources. Standard Denial of Service and other availability controls must be used to mitigate availability attacks on DNS services.

As with many Internet protocols, DNS wasn’t built with security in mind. As such, it has several known design flaws and attack vectors that you must protect against. In addition to DNSSEC, DNS infrastructures should be over-provisioned to support far greater requests than expected — this helps protect against volume-based attacks that can compromise availability.

Virtual private network (VPN)

A virtual private network, or VPN, allows a private network to be securely extended over a public network (like the Internet) by creating a point-to-point connection between the private network and a device that sits outside that network. VPN is commonly used in organizations that allow employees to telework (work from home or outside the office) and is also recommended when connecting to the Internet from untrusted hotspots, like the Wi-Fi in your favorite coffee shop. With a VPN, users are able to directly connect into a private network through a secure tunnel and experience that network just as if they were physically sitting with the rest of the devices on it. VPNs subject the remote device to all the same policies, monitoring, and restrictions as all other devices. It’s a must-use technology whenever connecting to a sensitive network over the Internet or other untrusted network.

Hardening the operating system through the application of baselines

When securing any physical computing device, one of the first things you should focus on is its underlying operating system. Creating and applying baselines is a standard procedure for securing operating systems in traditional IT and cloud environments alike; it involves establishing and enforcing known good states of system configuration. OS baselines should be developed with least privilege in mind and configured to allow only services, features, and access methods that are absolutely required for system functionality — turn off and disable any unnecessary services, close unused ports, and disable access routes that aren’t required.

Depending on your business requirements and technology implementations, your organization may develop a single baseline for each OS in your environment, or you may find it necessary to develop multiple baselines to cover specific use-cases. For example, it is very common to see a Windows baseline for webservers, another baseline for Windows-based database servers, and another one for employee desktops — each separate baseline must have its own policy that identifies what services and functions are required and which are prohibited.

No matter which OS you’re securing, the process for creating baselines generally starts with taking a fresh OS install and disabling all unnecessary ports, protocols, and services. If nobody will be browsing the Internet, then go ahead and remove that vulnerability-prone browser. If only HTTPS traffic is allowed, then consider disabling port 80. Anything and everything that is not needed for system functionality should be shut down, deleted, and disabled. Doing so reduces the attack surface of not just one device, but every single device that gets this baseline configuration. Aside from this generic process, specific OS vendors have their own tools and best practices for baseline configuration and deployment. I explore Windows, Linux, and VMware baseline management in the following sections.

Tip In addition to applying security best practices and ensuring least privilege, OS baselines can and should be used to enforce any OS-related legal and regulatory compliance obligations.

Windows

Microsoft offers several tools and resources to support baseline configuration and management. Endpoint Configuration Manager, Intune, and Microsoft Deployment Toolkit (MDT) all offer tools and processes for automating and deploying system configurations. Specifically, MDT enables you to deploy and manage system images and can be used to help you manage initial and ongoing security configurations of your baseline images.

Linux

Now, this is a fun one. Linux comes in many different flavors and distributions, and each includes its own set of default utilities and settings. As such, your process for creating secure baselines largely depends on which flavor you’re using. Some Linux variants are offered as bare-bones distros that come with very few processes and utilities; these distros may require minimal hardening, whereas some of the more robust Linux distros may need to be heavily modified and restricted, depending on your particular use-case. All in all, when establishing and applying baselines for Linux, you should first start with general Linux guidelines and best practices before moving onto vendor-specific guidelines and recommendations.

Tip Specific Linux best practices are outside the scope of this section. You should do additional research on common Linux security configurations if your specific implementation is Linux-based.

VMware

VMware is a historical leader in virtualization and cloud computing and has developed its operating systems with remote and automated baseline management features built in. VMware’s vSphere Update Manager (VUM) utility allows you to create and manage host baselines, virtual machine baselines, and virtual appliance baselines all in one. VUM provides insights into your entire vSphere OS deployment and allows you to quickly identify and fix hosts that are not compliant with their respective baseline. The tool can also help ensure that all systems under a given baseline have been patched and received any other security configuration updates.

Availability of standalone hosts

Within a computing environment, a standalone host is a physical machine that operates independently from other machines in the environment. Standalone hosts are commonly used in traditional data center models when it’s desirable to isolate one system from others, whether for security, compliance, performance, or other reasons. Although less common, standalone hosts are still used in cloud environments.

CSPs generally make standalone hosts available to customers who want to streamline their cloud migration. By offering standalone hosts, customers are able to maintain their existing system architectures and configurations as they move everything to the cloud. Another great use-case for standalone hosts is in providing physical separation between systems or datasets, often for compliance purposes.

Tip The concept of standalone hosts is growing in popularity among CSPs, again with regulatory and compliance being a huge driving factor. AWS offers services like dedicated hosts and dedicated instances, while Google Cloud offers sole-tenant nodes. Be sure to familiarize yourself with each product’s documentation to ensure they meet your technical and compliance needs.

Availability of clustered hosts

A host cluster is a group of hosts that are physically or logically connected in such a way that they work together and function as a single host. Each host’s resources are pooled together and shared among the cluster. Every host within the cluster is set to perform the same task, which is controlled and managed through software. Clustered hosts help protect against software and hardware failures of a single machine and can also be used for load balancing and parallel processing.

Remember Resource-sharing concepts like reservations, limits, and shares help orchestrate and manage resource allocation among tenants using the same host cluster. These topics are discussed in Chapter 5.

Tip Every CSP is different, but more times than not, you’re getting some form of clustered hosts when moving to the cloud. For the CCSP exam, you should know the differences between standalone and clustered hosts and when to use each.

Distributed resource scheduling (DRS)

Distributed resource scheduling (DRS) is a feature that enables clustered environments to automagically distribute workloads across physical hosts in the cluster — and yes, automagically is a word! In cloud environments, DRS helps scale workloads to maintain a healthy balance across physical hosts. DRS allows virtual hosts to move from one physical host to another, with or without the customer knowing (or caring).

Tip The topic of DRS is really interesting to me, but instead of me going on about it here, check out VMware’s DRS documentation at https://www.vmware.com/products/vsphere/drs-dpm.html for a look at one popular implementation of this technology.

Dynamic optimization (DO)

Dynamic optimization (DO) is the automated process of constantly reallocating cloud resources to ensure that no physical host or its resources become overutilized while other resources are available or underutilized. DO technologies use sophisticated algorithms and resource analysis to intelligently optimize resource utilization — reducing waste and cost, while improving service availability and performance. Dynamic optimization is one of the fundamental technologies underlying the fundamental cloud principles of autoscaling and rapid elasticity.

Storage clusters

The goal behind storage clustering is closely related to the principles of host clustering. Storage clusters, similar to their server peers, are a logical or physical connection of multiple storage systems in a way that allows them to operate as a single storage unit. In cloud environments, storage clusters support multitenancy, increases performance, and enables high availability (discussed in the following sections).

Maintenance mode

In many cases, a CSP needs to perform maintenance actions like patching or upgrading a server, or replacing a failed drive, for example; these types of actions often render a physical host temporarily unavailable. Maintenance mode allows a provider to gracefully move a tenant’s workloads to another physical host while maintenance is being performed. In ideal situations, maintenance mode occurs automatically and gracefully, with little or no impact to the customer’s ability to access their resources. I’ve seen maintenance mode in action, and the amount of automation that it requires is mind-blowing!

Tip Not all CSPs are created equally, and some have more sophisticated maintenance procedures than others. Ask a potential CSP for details about how system maintenance activity may or may not impact uptime and other SLAs.

High availability

One of the core concepts of cloud computing — and one of the greatest reasons for cloud migration — is high availability; customers expect their cloud services and workloads to be available whenever they need them, and it’s a CSP’s responsibility to making sure that happens. Clustered hosts, and clustering of other resources (like storage systems, networking components, and so on) is a key approach that cloud providers use to provide their customers with high availability. Just about every component must be built with high availability and elasticity in mind. Aside from clustering, CSPs ensure high availability through redundancy and by replicating customer data across zones and regions.

Availability of guest operating system

In the same way that clustered hosts and storage clusters increase the availability of physical resources, tools and processes to support redundancy increase the availability of guest operating systems. A cloud customer expects their guest OS to be available whenever they need it, and a cloud provider should really live up to that expectation if they want to keep customers in the crowded CSP market. A guest OS is little more than a file on the provider’s physical hardware, and so ensuring its availability starts with protecting that file via standard data protection mechanisms discussed throughout this chapter and the rest of the book.

Managing Physical and Logical Infrastructure for a Cloud Environment

Managing a cloud environment’s physical and logical infrastructure involves many different processes and technologies. To securely manage a cloud environment’s infrastructure, you must consider remote access, OS baselines, patch management, performance and capacity monitoring, network security, and a host of other management actions. In this section, you learn about the critical areas related to managing a physical and logical infrastructure for a cloud environment.

Access controls for remote access

Given the nature of cloud computing, customers and users require remote access to use and manage their logical environment. Remote access to the cloud’s physical environment, however, should be restricted to very few privileged CSP employees. As a CCSP, your job is to ensure that all routes, methods, and mechanisms used for remote access are fully secured and monitored.

Common remote access methods used include

  • Virtual private network (VPN)
  • Remote desktop protocol (RDP)
  • Secure terminal access
  • Secure Shell (SSH)

No matter which remote access methods are deployed and used, you must ensure that the underlying systems supporting remote access are hardened (via OS baselines), vulnerability scanned and patched, and monitored for suspicious activity (using event logging and other controls discussed throughout this book).

Remember Remote access comes with inherent risk, especially privileged access and when traversing the Internet. All remote access should be encrypted with TLS or a similar mechanism, and all privileged remote access should go through a secured system (like a bastion host or jump server). Multifactor authentication should be considered a requirement for all privileged access, and I highly recommend it for remote access, wherever possible. I introduce MFA in Chapters 2 and 5 and provide a deeper look in Chapter 6.

Operating system baseline compliance monitoring and remediation

I introduce the concept of OS baselines earlier in this chapter. Applying baselines is an important step to toward ensuring consistent and secure operating systems, but it’s far from a set it and forget it process. Once OS baselines are applied, it’s important to ensure that systems remain in compliance with those baselines through continuous monitoring. Compliance can be accomplished by using configuration management tools with automated baseline scanning functionalities. You load these tools with your approved baselines for each OS and scan your infrastructure to detect any noncompliant systems. In doing so, you are able to spot systems with unexpected configurations, which may occur for several reasons:

  • Your baseline was not applied to all systems, either intentionally or accidentally.
  • A legitimate administrator made an unauthorized change to a system.
  • A malicious outsider gained access to a system and made modifications — for example, discovering an unauthorized open port may be an indication that a hacker has set up backdoor communication.
  • A system has an approved policy exception that the baseline scanner did not account for (for example, port 80 may be disallowed in your Linux baseline, but allowed on a handful of webservers that use HTTP instead of HTTPS).

Patch management

Patch management is a subset of configuration management that includes all processes for finding, testing, and applying software patches (or code changes) to your systems. Patch management includes more than just vulnerability management, although that is a huge component of it. Patches are released by software and firmware vendors and made available for their customers to install; the patches typically provide new functionality, improve stability, and, of course, fix known security vulnerabilities. Every organization should develop a patch management policy that identifies which assets require what type of patches on what schedule. In addition, your organization should create and maintain a set of patch management procedures that define what steps to take in order to acquire relevant patches, validate their appropriateness for your environment, apply them to all assets, and test their effectiveness.

Organizations approach patch management in several ways, but a typical patch management process may look something like the following:

  1. Identify.

    The first step is to identify that a required patch actually exists. This step entails gathering information from all relevant vendors about all outstanding patches for all your software and firmware. In most cases, vendors offer notification mechanisms or other means for customers to quickly identify when new patches are released. This step can be quite tedious for large public clouds with many systems and numerous types of software. All outstanding patch information should be aggregated into a database or patch management tool for consistent, automated, and streamlined deployment.

  2. Acquire.

    After identifying that patches exist for your environment, this next step is to actually obtain those patches by downloading the appropriate executables, scripts, or packages. Some software is able to phone home and download relevant patches directly within the software itself, while other software requires navigating to the vendor’s website or other means. Your patch management procedures should include documented steps to procure patches for each vendor within your environment.

  3. Test.

    Many vendors provide hashes for you to validate the integrity of their patches after downloading — installing a compromised patch can wreak havoc on your systems, and hashes help prevent that. Regardless of whether you have a hash, you should first test patches in a development environment to ensure that they don’t break your systems or introduce new risks. Once you confirm that a patch causes no harm, then you can approve it for deployment/installation.

  4. Deploy.

    Now that you’ve found, acquired, and tested your patches, you can go ahead deploy them across your production environment. You should consult your patch management policy to determine the appropriate patching schedule. In many cases, organizations implement deployment windows during which patches and system updates can be made. Automation is critical here, especially in large cloud environments. In most cases, cloud infrastructures rely on patching baseline images and redeploying those baselines to impacted systems; patching images removes the need to individually patch thousands of systems and reduces the risk of one-off errors.

  5. Validate.

    This final step requires that you validate that your patches are installed properly and confirm that everything is in working order. In some cases, it can be as simple as a looking for a confirmation message, but others can be as involved as running tests to ensure the underlying host is operating as expected.

Tip I cannot stress enough how important automation is for cloud patch management. Infrastructures with thousands (or even hundreds of thousands) of systems cannot be effectively managed individually or manually. I have seen some very sophisticated automation technologies used to streamline the entire patching lifecycle, and even then, patch management can still be a major headache. Doing your homework early on and developing a robust patch management plan and process can help tremendously.

On the cloud customer side, remember that your patch management responsibilities are dependent upon your particular cloud service model (IaaS, PaaS, or SaaS) and what cloud services you’re using. I’ve seen this be a major point of confusion for cloud customers who incorrectly believe that patching is 100 percent handled by the CSP, even for IaaS services. The cloud provider is completely responsible for patching the underlying infrastructure, but IaaS (and even PaaS customers, to an extent) must be aware of their patching duties. For example, if you’re using AWS EC2, Amazon is responsible for patching the underlying physical hosts that run your EC2 instances, but you are wholly responsible for patching your OS images. Consult your CSP’s documentation for information specific to your implementation.

Performance and capacity monitoring

Performance monitoring is a critical task that every cloud provider must perform on a continuous basis to ensure that systems are running dependably, and customer SLAs are being met. Performance monitoring involves routine collection and analysis of performance metrics for key components of the cloud environment. Key components that should be monitored include network, compute, and disk, and memory. Most vendors offer performance metrics and recommend techniques for monitoring them, but some things to examine include

  • Network: Latency, dropped packets, and other standard network performance metrics
  • Compute (for example, CPU): Excessive utilization and response time
  • Disk: Read/write times and capacity limits
  • Memory: Excessive usage and capacity limits

Remember When conducting performance and capacity monitoring, it’s important that CSPs consider their ability to support all current customers as well as their capacity to support autoscaling and future customer growth. Due to rapid elasticity, a cloud provider must have enough capacity to withstand an unexpected spike in resource utilization by one or more customers; capacity planning must take this into account.

Hardware monitoring

When talking about monitoring cloud environments, it’s almost natural to jump straight to the virtual infrastructure and its resources. After all, cloud computing is all about virtualizing network, compute, and storage resources, right?! While that may be true, it’s important not to forget that underneath all of that virtualized infrastructure lies actual physical hardware that must also be monitored. The same key components (network, compute, disk, and memory) must be monitored in the physical plane just as you do for virtual performance and capacity monitoring. In many cases, you can use the same processes and even similar tools to monitor physical hardware and virtualized infrastructure. As always, make sure to first check with each hardware vendor for recommended monitoring utilities and best practices.

In addition to the standard monitoring mentioned in the previous paragraph, hardware should also be monitored for things like fan speed, device and surrounding temperatures, and even voltage readings. This type of information can help you track the overall health of your hardware and can help you identify overworked or aging hardware before it fails.

Monitoring vast cloud infrastructures is a monumental task. Due to resource pooling, clustering, and other high availability features, it’s often necessary to consider large sets of hardware rather than each machine individually. Consistent with a major theme of this chapter, automation should be used wherever possible to streamline these efforts. Most integrated hardware monitoring utilities provide dashboards with at-a-glance details of your hardware’s health status and any important metrics or alerts.

Configuring host and guest operating system backup and restore functions

Whether in a traditional data center or a cloud environment, backup and restore are some of the most critical security functions. Ensuring that important data is backed up and easily restored, when needed, supports the core information security principle of availability. For cloud infrastructures, it’s essential to configure physical hosts as well as guest operating systems so that important systems and data are available and functional after an incident or disaster.

You should be familiar with three main types of backup technologies:

  • Snapshots
  • Agent-based
  • Agentless

Snapshots

Snapshots are not a true backup, in the purest sense, but I’ll get to that. Simply put, a snapshot is a copy of a virtual machine, its virtual disks, and any settings and configurations associated with the VM. The snapshot is saved to disk as a simple file.

Snapshots are often performed before deploying software patches or other potentially unsafe operations in a virtual machine. If something should fail, snapshots are a great way to roll back to a specific point in time. Snapshots should be differentiated from backups for a couple reasons. Snapshots are not complete, on their own and rely on the virtual machine’s parent disk and file system; if the parent disk is deleted or corrupted, a snapshot cannot help you restore your system. Snapshots are saved on the same storage infrastructure as the VM itself. With true backups, it’s important to keep the backup completely separate from the original data; VM snapshots are neither separate from the underlying physical host nor the virtual infrastructure of the VM.

Agent-based backups

Agent-based backup mechanisms were considered the gold standard when virtualization first became popular during the Stone Age (circa 1990s). Agent-based backup requires installing a component (the agent) on every system in your environment that handles the backup duties for you and sends it over to your desired backup storage. Being installed directly on a system gives agent-based backups an edge when particular systems or applications require direct access to perform complete backups.

The downside, as you’ve probably figured, is that agent-based backups can become unwieldly and hard to manage in large environments (like a public cloud). Enter agentless backups.

Agentless backups

Agentless backups are not truly agentless, but instead require a single agent be installed on any given network. This single agent provides centralized network-wide backup capabilities without you needing to somehow interact with every endpoint. Generally speaking, the more systems you have to backup, the more attractive agentless solutions become. So, for a cloud environment, agentless backup tends to offer greater control, simplified management, and lower cost than agent-based solutions.

It’s not all rosy though. Agentless backup requires a fully virtualized environment, so networks with both physical and virtualized servers require an agent-based or hybrid approach. Further, agentless backup doesn’t work with every system, and so it’s important that you consider your architecture before choosing one backup solution over the other.

Tip Agentless backups generally interact directly with your hypervisor to snapshot and backup your VMs. To use an agentless solution, first make sure that the backup product supports integration with your particular hypervisor.

Network security controls

Securely managing the logical and physical infrastructure for a cloud environment requires security controls at multiple layers. (You can read about defense-in-depth in Chapter 2.) A layered approach to network security relies on several types of technologies, including firewalls, IDS/IPS, and more.

Firewalls

Firewalls are a core network security control that you’re probably familiar with. It’s important that you not only know what they do, but also how they fit into a cloud’s network security stack. Firewalls are hardware or software systems that monitor and control inbound and outbound network traffic. Firewalls rely on customized rules to allow and deny traffic to, from, and within a network or device. Traditional data centers tend to use hardware firewalls, but the heavy reliance on virtualization in cloud environments makes software firewalls (or a mix of the two) more suitable.

Tip Most modern firewall appliances are considered next-gen firewalls that include IDS/IPS, WAF, or other functionality all-in-one. When crafting your network security architecture, treat each of these components as if they are separate devices connected together (like the good ole days) and focus on each component’s individual capabilities.

Firewalls play two main roles in cloud networks: perimeter control and internal segmentation. Cloud customers are generally exposed to firewalls in two ways: virtual firewalls and network Access Control Lists (ACLs). Virtual firewalls are logical equivalents of their physical brethren. Network ACLs allow you to define rules that control traffic into and out of a network without an actual firewall appliance. Most CSPs offer customers a feature called virtual private cloud, or VPC, which simulates virtual firewall functionality. By using VPCs and network ACLs, a cloud customer is able to manage their logical infrastructure similar to if it were their own data center.

Intrusion detection systems (IDS) and intrusion prevention systems (IPS)

An intrusion detection system (IDS) is a hardware appliance or software application that monitors networks and systems and alerts designated personnel of any malicious or unauthorized activity. An intrusion prevention system (IPS) performs very similarly, with one major difference: IPS is designed to actually block suspected attacks, in addition to alert on them.

Now you may be wondering why anyone would choose an IDS over IPS — it’s better to block attacks than alert them, right? Well, one thing to keep in mind is that these devices are known for generating a high number of false positives, especially when first installed. Careful configuration and ongoing tuning help to reduce these false positives, but IPS is best used only when it will not create a Denial of Service for authorized traffic.

Technical stuff Whereas firewalls operate by filtering traffic based on IP addresses, ports, and protocols, IDS and IPS systems use deep packet analysis to analyze network traffic and compare network packets against known traffic patterns or attack signatures. Some IDS/IPS devices can even operate in signature-less fashion and trigger on anomalous activity (for example, unknown activities that smell fishy). IDS/IPS devices are generally placed after a network firewall and act as a second line of defense.

Warning Because they rely on deep packet analysis, IDS/IPS cannot fully examine encrypted network traffic. As a cloud security professional, you must evaluate the tradeoffs between ubiquitous encryption and intrusion monitoring and consider where and when it makes sense to temporarily decrypt data for inspection.

The two categories of IDS are

  • Host IDS (HIDS): This type of IDS operates on a single host and monitors only network traffic that flows into and out of that host. In addition to monitoring a host’s network traffic, HIDS are often able to monitor critical configurations and files on a host and can be configured to alert on suspicious modifications. Similar to other host-based security controls, HIDS are prone to compromise if an attacker gains root-level access on that host. To combat this, HIDS logs should immediately be sent a remote system (like your centrally managed SIEM), and HIDS configurations and settings should be locked down and managed on a remote system.

    Tip Consider installing a HIDS on your baseline images for your highly sensitive systems. Configure the HIDS to communicate with your SIEM or other centrally managed alerting dashboard. You can then deploy and manage those distributed HIDS in one fell swoop.

  • Network IDS (NIDS): This type of IDS is installed along the network and is able to monitor network traffic across multiple hosts, instead of just one. NIDS are traditionally installed after firewalls, on a network tap. By having broader network visibility, NIDS can identify suspicious trends across hosts that may be too hard to spot on individual machines. The sheer volume of network traffic and the number of alerts these tools can generate is the primary challenge with NIDS. Instead of using NIDS to monitor entire cloud networks, they should be installed in strategic locations near high-value assets, such as your management plane subnet.

Remember In practice, most environments benefit from some combination of both IDS and IPS; for that reason, and because of their structural similarities, the two are often mentioned together. Despite this, you should have a clear understanding of their differences, and you should know the differences between host-based and network-based IDS.

Honeypots

Here’s a treat for you! A honeypot is a decoy system that mimics a sensitive system in order to lure attackers away from the legitimate target. Honeypots are typically designed to mirror your production environment, but they host dummy data instead of actual sensitive information. It’s also common to leave a few known vulnerabilities in order to make a honeypot easier to find and more attractive to hackers. By using a honeypot, you can trap hackers in a fake environment and watch as they attempt to do damage. In doing so, you’re able to learn the attackers’ origins and techniques, which can help you better protect your actual network and systems. Sounds pretty sweet!

Network security groups

Most cloud providers use the notion of a security group, which is basically a network ACL that operates at the VM level rather than the network level. In many cases, security groups may not be as full-featured as network ACLs, but the core functionality is there. A network security group, however, is a feature popularized by Microsoft, that effectively combines the concepts of security groups with network ACLs; network security groups allow you to control traffic to and from either an OS or an entire network.

Management plane

Managing the physical and logical infrastructure of a cloud environment requires effective task and operations scheduling, efficient orchestration of activities and resources, and comprehensive, yet minimally disruptive maintenance. All of these activities take place in the management plane.

Scheduling

Cloud infrastructures involve managing a large variety of virtualized resources, and cloud providers employ task scheduling to fulfill and manage customer requests related to those resources. Scheduling and resource allocation go hand in hand: Scheduling is the process of taking customer resource requests and prioritizing those requests (or tasks) in such a way that available resources are assigned and utilized most efficiently. When multiple customers request compute power, for example, the CSP’s task scheduler considers the nature of each request (whether it’s for CPU or GPU, for example), the size of the request (e.g. how much compute power is needed), the length of time the resource is needed, and the availability of suitable resources. Based on all of these factors and other metrics, the scheduler’s algorithms determine which customers to supply with the appropriate resources, at what time, and for how long. Proper scheduling enables efficient resource utilization and keeps request pipelines moving smoothly.

Orchestration

Another essential component of cloud management, orchestration is the organization and integration of automated tasks between systems and services in a cloud environment. Cloud orchestration allows you to create a consolidated workflow from multiple interconnected processes. With orchestration, you can automate complex processes such as resource provisioning, allocation, and scaling. Orchestration is absolutely required to enable cloud providers to provide highly available, dynamically scaling services to customers with little to no human interaction.

Warning Orchestration is often confused with automation. The two concepts are very much related, but not the same. While automation refers to a single task or process, orchestration is the coordination and management of these automated tasks into a streamlined, automated workflow.

Maintenance

Yet another critical component of managing a cloud environment is conducting routine and emergency maintenance of the physical and logical infrastructure. In traditional data center environments, operators serve very few customers — maybe even a single customer, which may be themselves. This type of independence lets data center operations teams easily find the best time to conduct maintenance simply by coordinating with their users. Public cloud infrastructures that serve thousands of customers simply don’t have this luxury. As such, cloud providers need to carefully consider potential customer impact when invoking maintenance mode (discussed earlier), and they should ideally leverage orchestration tools to minimize or even remove this impact altogether. Whenever possible, customer workloads should be (automatically) routed to stable systems while maintenance occurs on physical or logical resources. Anytime a CSP suspects that maintenance might impact customers’ performance or availability, the provider should communicate this to all potentially impacted customers, as soon as possible, with details of the maintenance window and its expected impact.

Implementing Operational Controls and Standards

Many processes need to be managed when it comes to cloud security operations. Fortunately, many of these processes are similar to traditional IT operations, and well-established standards can guide your understanding and application of those processes. ISO/IEC 20000 is a common international Service Management System (SMS) standard that defines requirements for service providers to plan, develop, operate, and continually improve their services. The standard was most recently updated in late 2018, and you can check out part 1 of it (ISO/IEC 20000-1) by visiting https://www.iso.org/standard/70636.html.

Another well-known standard is ITIL, which used to be an acronym for Information Technology Infrastructure Library. ITIL is a framework that establishes a set of detailed IT Service Management (ITSM) practices. You can find more information on ITIL by visiting https://www.axelos.com/best-practice-solutions/itil. The following list is primarily derived from the ITIL framework and aligns with the core practices identified in the CCSP exam outline:

  • Change management
  • Continuity management
  • Information security management
  • Continual service improvement management
  • Incident management
  • Problem management
  • Release and deployment management
  • Configuration management
  • Service level management
  • Availability management
  • Capacity management

The preceding list includes processes from ITIL’s Service Design, Service Transition, and Service Operations stages, and you can find similar information in ISO 20000-1. You can learn about all 11 processes in the following sections.

Change management

Change management is an IT discipline focused on ensuring that organizations employ standardized processes and procedures to make changes to their systems and services. The overall objective of change management is to allow IT changes to be made in a structured and secure manner, while minimizing negative impacts to users.

You may hear the term change management used in project management circles and assume that it’s strictly up to a PM to handle change management. Unfortunately, you’re not quite off the hook. Changes to an IT environment may potentially impact the security posture of that environment, and so every cloud security professional has a responsibility to partner with their organization’s Program Management Office (PMO) to securely manage these changes.

In a cloud environment, some activities that require change management include

  • Buying, building, or deploying new hardware or software
  • Upgrading or modifying existing hardware or software
  • Decommissioning hardware or software
  • Modifying the operating environment (for example, changing data center humidity levels)
  • Modifying IT documentation (yup, this needs to be considered an extension of the system itself and properly controlled)

What does an IT change management process look like?, you ask. Well… I’m glad you asked! ITIL, ISO, and other frameworks break change management down into a series of steps. These steps may vary from one framework to the next, but in general the process looks something like this:

  1. Request a change.

    During this step, a necessary change is requested or somehow otherwise identified. The person or team requesting the change typically creates an RFC (request for change) that documents the details of the request, including such details as the requestor’s contact info, time requirements, business justification, and more.

  2. Evaluate the requested change.

    The RFC is reviewed for completion, appropriateness, and feasibility. During this step, you should evaluate the change to assess benefits and risks associated with it. You want to leave this step with a clear understanding of all pros, cons, and impact to your systems and business if you make the requested change. Who evaluates an RFC depends on the size and scope of the particular change. For larger or more impactful changes, your organization may rely on a Change Advisory Board (CAB) or Change Control Board (CCB) to evaluate the RFC. A CAB usually consists of stakeholders throughout the organization, while CCBs are most often limited to project-level stakeholders. Having an information security representative (like a CCSP) on both the CAB and CCB are important and often overlooked by organizations.

  3. Approve the change.

    Once the RFC has been thoroughly evaluated, a recommendation to authorize the change is sent to the appropriate authority, depending on the request. Approval can be anything from getting a Senior Sys Admin to sign off on replacing a hard drive to requiring a VP to approve a data center expansion.

  4. Plan the change.

    Planning the change again depends on the size and scope of the effort. Smaller changes may require a quick writeup, whereas larger changes call for full project plans. Identify any resources you need and the steps you must follow to complete the change.

  5. Implement the change.

    With a plan in hand, go forth and make it happen!

  6. Review the change.

    During this final step, test that your change was implemented successfully and validate that no unintended consequences have occurred. For larger changes, a post-implementation review (PIR) should be developed to document the implementation results and any lessons learned.

Continuity management

Continuity management helps ensure that a CSP is able to recover and continue providing service to its customers, even amidst security incidents or during times of crisis.

Continuity management involves conducting a business impact analysis (BIA) to prioritize resources, evaluating the threats and risks to each of those resources, identifying mitigating controls for each risk, and developing a contingency plan that outlines how to address potential scenarios. See Chapter 3 for full coverage of business continuity planning (BCP).

Information security management

Information security management codifies the protection of your environment’s confidentiality, integrity, and availability as part of your overall IT management objectives. Information security management basically includes everything covered in Chapter 2 and in many places throughout this book being made an official part of your IT service management plan. This practice involves planning, building, and managing security controls that protect your systems and data against security risks. Having information security specifically called out as a management objective ensures that it’s treated as an integral part of all IT decisions.

Remember ISO/IEC 27001, mentioned a couple times in this book, is one of the leading standards for information security management.

Continual service improvement management

Continual service improvement management is borrowed from ISO 20000-1, and is a lifecycle of constantly improving the performance and effectiveness of IT services by collecting data and learning from the past. The goal of this process is to perpetually search for ways to make your systems and services better than they currently are.

Incident management

Incident management is the process of monitoring for, responding to, and minimizing the impact of incidents. In this case, an incident is any event that can lead to a disruption in a service provider’s ability to provide its services to users or customers. I introduce incident handling (the tactical part of incident management) in Chapter 2.

Problem management

Problem management is the process of managing any and all problems that happen or could happen to your IT service. The objective here is to identify potential issues and put processes and systems in place to prevent them from ever occurring. Achieving this objective requires analyzing historical issues and using that insight to predict potential future problems. Problem management does not always succeed at preventing issues, and so the process also involves putting processes and systems in place to minimize the impact of unavoidable problems.

Release and deployment management

The objective of release and deployment management is to plan, schedule, and manage software releases through different phases, including testing in development environments and deployment to a production environment, while maintaining the integrity and security of the production environment. Successful release and deployment management requires collaboration between developers, project managers, the release team, and those responsible for testing. A release management team is generally constructed during the early planning stages and oversees the release through development, testing, troubleshooting, and deployment.

Configuration management

I discuss application configuration management in Chapter 6, but I’m going to take a broader view for a second. Configuration management is the process of tracking and controlling configuration changes to systems and software. Configurations include information about all physical and logical components within the cloud environment, and proper configuration management encompasses all settings, versions, and patch levels for your hardware and software assets. Your organization should have a configuration management plan that identifies how you identify, plan, control, change, and maintain all configurations within your environment. Due to the sprawling and dynamic nature of cloud environments, cloud providers and their customers should seek to use automated tools and techniques to streamline the configuration management process.

Service level management

There are few acronyms I hear more than SLA when talking to cloud customers. It’s discussed all over this book, and for good reason — service level agreements are contracts between a CSP and cloud customer that set the customer’s expectations for the minimum level of service they’ll receive. As such, service level management is a huge part of cloud business, and it involves negotiating, developing, and managing all CSP SLAs.

You can read more about SLAs in Chapter 3.

Availability management

Availability management is the process of ensuring that the appropriate people, processes, and systems are in place in order to sustain sufficient service availability. Availability management involves establishing your system’s availability objectives (for example, SLAs) as well as monitoring, measuring, and evaluating your performance against those objectives. An underrated objective of an availability manager (unless you’re a cloud provider) is achieving all these goals as cost-effectively as possible.

Capacity management

Capacity management is the process of ensuring that the required resource capacity exists, at all times, to meet or exceed business and customer needs, as defined in SLAs. Similar to availability management, cost-efficiency is paramount here. While having insufficient capacity can lead to poor service performance and loss of business, overspending on capacity can lead to less profit, lower margins, and potentially unsustainable business. It’s certainly a fine line between provisioning too little capacity and too much, but it’s a line your organization must spend time carefully defining.

Supporting Digital Forensics

Digital forensics, sometimes called computer forensics, is a branch of forensic science that deals with the recovery, preservation, and analysis of digital evidence associated with cybercrimes and computer incidents. Digital forensics is an intersection between law, information technology, and the scientific method.

While it is a fairly established field, the intersection of digital forensics and cloud computing is still unfolding in some ways. It’s important that you, as a CCSP, recognize and understand how the uniqueness of cloud computing shapes the way you conduct digital forensics.

Collecting, acquiring, and preserving digital evidence

Generally speaking, the core principles of digital forensics are the same in the cloud as they are in traditional data center environments — you still must collect digital evidence and conduct thorough analysis on it while maintaining its chain of custody. Due to the fundamental differences between cloud environments and traditional data centers, some forensic issues are unique to cloud computing. Among the most important concerns are data ownership and shared responsibility, virtualization, multitenancy, and data location. In the following sections, I cover how to address these concerns as you collect and manage forensic data in the cloud.

Data ownership and shared responsibility

One of the biggest hurdles in cloud computing is the concept of data ownership and understanding how the shared responsibility model impacts it. In the traditional data center model, organizations have no question about data ownership — my building, my servers, my data! Data ownership can get a bit fuzzy when moving to the cloud, and it presents probably the biggest issue for digital forensics. A customer’s level of ownership varies based on the cloud service model being leveraged.

For IaaS services, customers have the highest level of ownership and control over their resources and data. As an IaaS customer, you have control over your VMs and everything built on them. Generally speaking, this service model provides a high degree of access to event data and logs to support forensic analysis, but customers are still reliant on the CSP for infrastructure logs and forensic data, if required.

For PaaS and especially SaaS services, customers have noticeably less ownership and control over systems, thus yielding less independent access to data required for forensic investigations. In PaaS and SaaS service models, customers must depend on the provider to capture and appropriately preserve most evidence. Some CSPs have policies that they do not share this type of information with any customers, whereas other providers agree to fully support customers with forensic investigations.

Regardless of your service model, collecting relevant information and maintaining chain of custody requires collaboration between a CSP and their customer. As a CCSP, you should be involved in reviewing all cloud contracts and SLAs to determine whether a given provider can support your digital forensics requirements, if ever needed.

Virtualization

At the heart of cloud computing lies the concept of virtualization — CSPs manage a physical environment that lets customers access and manage logical resources. The use of virtual machines with virtual storage, memory, and networking requires that you not only need to conduct standard forensic activities on your physical hardware, but must also treat your logical environment as if it were physical, too. This requirement means you don’t want to shut down VMs before collecting evidence because any valuable information stored in virtual memory would be wiped out. You must pay attention to preserving evidence in your logical infrastructure in order to support your customers’ needs for forensic data.

Multitenancy

In traditional data centers and on-premise environments, organizations have complete confidence that their data is physically separate from anyone else’s. When conducting forensic activities, they can be certain that evidence is not tainted by another organization’s presence.

Cloud providers, however, colocate multiple customers on the same physical infrastructure, which can cause issues with data collection because the collected data potentially contains information from multiple customers. If customer X has a need to provide forensic records to authorities, it can be a problem if their data is not sufficiently isolated from customers A through W.

To combat this issue, CSPs should enable strong logical controls (like encryption, VPCs, and so on) to isolate tenants from each other. Further, it’s important that providers and customers take steps to accurately document where separation does not exist and what data is out of scope.

Data location and jurisdiction

Public cloud environments have data centers and regions around the world. As I’m writing this book, AWS and Google Cloud both have 22 regions around the globe, and that number is probably even higher as you’re reading this. That’s great for providing high-performing services just about anywhere on Earth (or Mars, just in case a Mars region has launched by the time you read this).

Benefits aside, geographic dispersion can make collecting and managing forensic data a nightmare. When a customer’s virtualized workload is spread across many systems in a data center or even multiple data centers, it can be difficult to thoroughly track down all relevant data during an investigation. Before forensic data collection begins, a CSP must ensure that all locations of data are accounted for.

Tip Data location is not only a challenge when conducting customer-centric forensics, but also when a CSP needs to collect and preserve evidence for their own forensic investigations. Tracking the location of all provider and/or customer data is a massive task that requires providers to build sophisticated tooling and automated processes.

Aside from the technical challenges associated with data location, jurisdiction often comes into play. When a cybercrime occurs, the laws and regulations that govern the region present challenges. A court order is issued where the impacted data is located (let’s say Germany) may not be applicable to the same data stored in the United States. Many CSPs now offer the ability for customers to select or restrict which region(s) their data is located. Cloud customers must understand what laws, regulations, and compliance requirements impact their data and choose CSPs and regions that can appropriately support their needs.

Evidence management

Managing chain of custody from evidence collection to trial is the most important part of any digital forensics investigation. You could have the best tools and the smartest analysts generate the most useful data, but it would all be for naught if you aren’t able to protect and prove the integrity of your evidence throughout its entire lifecycle. I cover chain of custody and non-repudiation in Chapter 4. If you haven’t already, you should really check it out for additional background.

Cloud providers must maintain well-documented policies and procedures for evidence management. Policies should establish what types of data to collect and how to properly engage with customers, authorities, and other stakeholders. It’s important that a provider understands how cloud-specific factors like multitenancy and virtualization impact their collection and preservation of evidence. In some cases, contractual obligations or jurisdictional requirements might mandate that a CSP disclose their collection activities and processes to third parties. In other situations, the investigation may be deemed confidential and require that evidence collection and management processes not be disclosed at all. For either scenario, requirements should be fully documented and understood prior to collecting or handling evidence.

Managing Communication with Relevant Parties

Effective communication is critical in any business. A service provider must be able to accurately, concisely, and timely communicate important facts to customers, vendors, and partners to ensure that business operates smoothly and end-users remain informed and satisfied.

For IT services, and especially cloud providers, communication with regulators is another important facet to consider, as regulations impact not only the provider, but potentially thousands of customers and other stakeholders.

Customers

As a cloud service provider, communicating with customers is among the top of your priorities. Anytime you provide a service to someone, it’s essential that you understand their needs and wants — an open dialogue between cloud provider and customer is the best way to achieve this goal. Customers are likely to directly engage CSP personnel most during the early stages of procurement, planning, and migration. Depending on the customer, a CSP may designate some combination of sales representatives, technical account managers, and solutions architects to learn a customer’s requirements and share the provider’s capabilities. Contract and SLA negotiations are pivotal during this time, and it’s important that the provider is fully transparent about what they can and cannot do.

Customers have less of a need to communicate directly with a CSP as their cloud journey progresses (at least they hope so!). The bulk of ongoing communications should be updates by the CSP related to service availability, system upgrades, and policy changes that have a customer impact.

Tip Effective communication isn’t solely based on phone calls, meetings, and emails. In fact, both customers and providers benefit from self-service access to information. One great example of this is AWS’ Service Health Dashboard (https://status.aws.amazon.com) that shows the current status for all services in all of their regions. This dashboard is a great example of automated communication that helps providers scale as business grows.

Perhaps the most important ongoing communication between a CSP and its customers involves providing up-to-date documentation. User guides and other product documentation is essential for customers to know best practices when using their cloud services. Even more essential, CSPs must ensure that they accurately communicate what responsibilities they have versus the customer’s responsibility. This information is usually documented in a customer responsibility matrix, or similar. These documents are often associated with some sort of regulation or compliance framework and can be used to communicate what security functions a CSP performs and what functions must be performed by the customer.

Vendors

Vendors (or suppliers) are a major part of a CSP’s ecosystem; I don’t know of a single CSP that doesn’t rely on multiple vendors in some form or fashion. Examples include hardware manufacturers, software vendors, outsourced maintenance professionals, and more. Many of these services are mission critical for a CSP; as such, effective communication with vendors is of the utmost importance. Most important CSP-vendor communication is captured within various contracts and agreements; these agreements should detail a vendor’s commitments to the CSP from onboarding through termination.

Partners

The type of partners that cloud providers rely on is wide-ranging and can include anything from companies that support go-to-market to organizations that build their SaaS solutions on your IaaS infrastructure. The lines can sometimes blur between partners and vendors or customers, but it’s important to evaluate your partner relationships and determine the best method and frequency to communicate with each of them.

Regulators

Just about every cloud provider and cloud customer has regulations to comply with. As a CCSP, it’s your job to understand these regulations and help your organization craft ways to satisfy them. Communication with regulators should happen early and often. When a cloud provider is first planning to build their infrastructure or making significant changes to existing infrastructure (like constructing a new region), they should maintain communication with regulators to ensure that all requirements are fully understood. Further, cloud providers generally benefit by having security-minded personnel review and comment on upcoming regulations before they’re finalized. Regulatory bodies are often a step or two behind bleeding-edge technology. CSPs that understand the technology and its limitations better than anyone should communicate these factors to regulators in order to help shape realistic policies and requirements.

Other stakeholders

Aside from customers, vendors, partners, and regulators, you may find other stakeholders that need to be involved in your communication plans. This need may be based on a particular project or a request from one of the previously discussed parties. In any case, you must evaluate the communication needs of all stakeholders, and determine what, how, and when to communicate with them.

Managing Security Operations

This chapter has a lot of information related to building and deploying security controls to operate and manage the physical and logical components of a cloud environment. It’s just as important to consider managing and monitoring those controls to ensure that they’re configured properly and functioning as intended. Security operations includes all the tools, processes, and personnel required to accomplish that goal.

Security operations center (SOC)

A security operations center (SOC) is a centralized location where designated information security personnel continuously monitor and analyze an organization’s security posture. Rather than focusing on developing security strategies or implementing security controls, a SOC team is responsible for the ongoing, operational aspects of security. SOCs should be staffed 24/7, and its team members are responsible for detecting, reporting, and responding to security incidents using a combination of technology and well-established processes.

Tip In the olden days, a SOC was strictly considered to be a large room with lots of computers, large monitors, and staff members. That definition still holds true today, but we’re seeing the emergence of virtual SOCs. A virtual SOC allows teams in different locations to share resources and communicate in real-time in order to monitor and respond to security issues. This concept is almost a requirement for globally dispersed cloud providers.

Remember Your organization’s incident response (IR) team is likely different from your SOC team, though there may be overlap. Once the SOC identifies a potential incident, it should quickly trigger incident response procedures and work with the IR team to investigate and resolve the matter.

The scope of a cloud provider’s SOC should include all their physical and logical assets — all hypervisors, servers, networking devices, storage units, and any other assets under the CSP’s control. While tenant’s guest operating systems are not generally monitored by a CSP, the SOC should still monitor and identify any malicious activity between a tenant and the CSP or from one tenant to another.

Monitoring of security controls

On a more tactical level, monitoring should be in place to assess and validate the effectiveness of your security controls. If you’ve used a defense-in-depth approach (and I know you have), then you’ll need layers of monitoring in place to continuously ensure that your controls remain in place, unmodified, and fully functional.

Monitoring your security controls should begin with documentation that defines the intent of each control and how it is implemented. You should also have documentation that identifies how to monitor each security control, either manually or by automated means. In some cases, vendors provide best practices on how to monitor their security controls. Other scenarios require that your information security team, led by their fearless CCSP, generates the necessary guidance. Make sure that you also maintain complete documentation of all security configuration baselines. I talk about documenting baselines earlier in this chapter, and it becomes especially important when it’s time to monitor them.

Conducting vulnerability assessments and penetration tests are a great way to indirectly assess the effectiveness of your security controls. These tests expose weaknesses in your environment, and penetration tests can even show you the impact of those weaknesses being compromised. Performing these tests can highlight weak areas in your security architecture and help you identify needs for new or modified security controls.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.171.235