DOMAIN 2
Cloud Data Security

RESPONSIBILITY FOR MANY ELEMENTS of cloud security are shared between the cloud service provider (CSP) and the cloud consumer, while each party retains exclusive control over some elements. For example, the CSP is always responsible for the physical security of the infrastructure, while the consumer retains control over the identity and access management concerns of their applications and services. When it comes to securing data stored in the cloud, the most important thing to remember is that the consumer is always ultimately accountable, which means they must not only take steps to secure data but also ensure the CSP is implementing adequate security practices.

DESCRIBE CLOUD DATA CONCEPTS

Data is a crucial element to most modern organizations, so processes to provision and secure the systems that store, process, and transmit that data are essential. The cloud strategy for most organizations will include a variety of personnel in different roles, from the top C-level executives or other executive management all the way to operational personnel responsible for day-to-day functions such as data input and processing. To provide adequate security, it is vital to have a model for understanding how data is created, stored, and used in the cloud environment such as the cloud data lifecycle.

Cloud Data Lifecycle Phases

Unlike valuable physical assets such as precious metals or other objects, information can be hard to definitively identify and secure. Data is constantly being generated, used, stored, transmitted, and, once it is no longer valuable, destroyed. In such a dynamic environment, it can be useful to model the phases that information passes through. This model provides a generic way of identifying the broad categories of risks facing the data and associated protections, rather than trying to identify each individual data element in an organization.

The secure cloud data lifecycle is roughly linear, though some data may not go through all phases, and data may exist in multiple phases simultaneously. Regardless, data should be protected in each phase by controls commensurate with its value. Figure 2.1 illustrates the lifecycle.

Schematic illustration of the secure data lifecycle.

FIGURE 2.1 The secure data lifecycle

The cloud data lifecycle is not always iterative; that is, it does not repeat, unlike other lifecycles such as the systems development lifecycle. Data may also exist in one or more of the phases simultaneously; for example, data being created may be shared and stored at the same time if the user saves a file to a shared drive that other users can access. However, data in each phase of the lifecycle faces specific risks, and the cloud security practitioner should thoroughly understand these risks and appropriate mitigations specific to each phase.

  • Create: Data is created when it is first entered into a system or whenever it is modified. Examples include a user writing a document and saving it, a process collecting data through an API and placing the data into a database, or a user opening a shared document and updating its contents. If data is being modified, the data lifecycle iterates, but it is also normal for data to be created once and never be updated.
    • Create phase controls: Data classification is a foundational security control, as it allows the organization to identify data's value and implement appropriate controls. During creation, data should be classified by the creator or owner. In some systems, this might be a manual process such as placing the classification in a document header/footer. In other cases, the classification of data may be done by a system owner. In this case, the users do not make classification decisions, but all data stored in the system will be classified the same and protected by system-level controls.
  • Store: Storage is the act of saving data in a retrievable location, such as a solid-state drive (SSD), application database, or hard-copy records. In many cases, the storage phase may occur simultaneously with creation; for example, a user uploads a new document to a fileshare.
    • Store phase controls: Data being stored may need protection in transit, especially for cloud services where the data must cross public networks to reach the organization's cloud apps. These protections include the use of Transport Layer Security (TLS), a virtual private network (VPN), Secure Shell (SSH), or other secure data in transit controls. The actual act of storing data should be governed by policies and procedures, such as restrictions on where data of differing classification levels may be stored, access control and granting procedures, and technical controls such as encryption implemented by default for highly sensitive data. Once stored, data should be protected using appropriate access controls and encryption management to preserve confidentiality, as well as adequate backups to preserve both integrity and availability.
  • Use: The use phase comprises accessing, viewing, and processing data, sometimes collectively referred to as handling. Data can be handled in a variety of ways, such as accessing a web application, reading and manipulating files, or fetching data via an API.
    • Use phase controls: The use phase is typically the longest lasting phase of the data lifecycle—just as systems spend most of their life in the operations and maintenance phase while being actively used. The number of controls is commensurate, such as managing data flow with data loss prevention (DLP), information rights management (IRM), system access controls such as authorization and access reviews, network monitoring tools, and the like. Accountability controls are also crucial in this phase, which requires adequate logging and monitoring of access. Note the data states, which are data in use, data in transit, and data at rest, are not directly related to this data lifecycle phase. Data being actively “used” by a company may be stored on a laptop hard drive (data at rest) or be sent to a web app for processing (data in transit). The use phase is simply when data is actively being accessed or handled.
  • Share: The share phase is relatively self-explanatory: access to data is granted to other users or entities that require access. Not all data will be shared, and not all sharing decisions will be made explicitly. Highly classified data may be severely restricted to access by just a few critical personnel, while less sensitive data may be shared by virtue of being stored in a shared system like a collaboration app.
    • Share phase controls: Access controls will form the majority of security safeguards during the share phase, both proactive and reactive. Proactive access controls include role-based access authorizations and access-granting procedures. Reactive controls such as DLP, IRM, and access reviews can detect when unauthorized sharing occurs and, in some cases, prevent the shared data from leaving organizational control.
  • Archive: Data may reach the end of its useful life but still need to be retained for a variety of reasons, such as ongoing legal action or a compliance requirement. Archiving data is the act of placing it into long-term retrievable storage. This may be required depending on the type of organization and data you are processing, such as financial or healthcare data, which has mandated retention requirements. Archiving is also an operational cost benefit, as it moves data that is not actively being used to cheaper, though possibly slower, storage.
    • Archive phase controls: Data controls in the archive phase are similar to the store phase, since archiving is just a special type of storage. Due to long timeframes for storage in archives, there may be additional concerns related to encryption. Specifically, encryption keys may be rotated on a regular basis. When keys are rotated, data encrypted with them is no longer readable unless you can access that older key; this leads to solutions such as key escrow or the cumbersome process of decrypting old data and re-encrypting it with a new key. Additionally, storage formats may become obsolete, meaning archived data is not readable, or archival media may degrade over time, leading to losses of integrity or availability.
  • Destroy: Data that is no longer useful and that is no longer subject to retention requirements should be securely destroyed. Methods of destruction range from low security such as simply deleting filesystem pointers to data and reusing disks to more secure options such as overwriting disks, physical destruction of storage media, or the use of cryptographic erasure or cryptoshredding, whereby data is encrypted and the keying material is securely erased. Data destruction can be a particular challenge for cloud security practitioners as the CSP has physical control over media, preventing the use of options like physical destruction. Additionally, cloud services such as Platform as a Service (PaaS) and Software as a Service (SaaS) typically do not provide access to underlying storage for tasks like overwriting.
    • Destroy phase controls: Choosing the proper destruction method should balance two main concerns: the value of the data being destroyed and the options available in a cloud service environment. Low-sensitivity data such as public information does not warrant extraordinary destruction methods, while high-value information like personally identifiable information (PII) does. Various resources provide guidance for selecting a destruction method, such as the NIST SP 800-88, Guidelines for Media Sanitization, available here: csrc.nist.gov/publications/detail/sp/800-88/rev-1/final.

Data Dispersion

Data dispersion refers to a technique used in cloud computing environments of breaking data into smaller chunks and storing them across different physical storage devices. It is similar to the concept of striping in redundant arrays of independent disks (RAIDs), where data is broken into smaller segments and written across multiple disks, but cloud-based data dispersion also implements erasure coding to allow for reconstruction of data if some segments are lost.

Erasure coding is similar to the idea of a parity bit calculation in RAID storage. In simple terms, data being written is broken into multiple segments, a mathematical calculation is conducted on the segments, and the result is stored along with the data. In the event that some segments are lost, the parity bit and the remaining segments can be used to reconstruct the lost data, similar to solving for a variable in an algebra equation.

Data dispersion in cloud environments can have both positive and negative impacts on an organization's security. The benefit of availability is obvious—even if a physical storage device fails, the data on it can be reconstructed, and if a physical device is compromised, it does not contain a complete, usable copy of data. However, dispersing the segments can cause issues. If data is dispersed to countries with different legal or regulatory frameworks, the organization may find itself subject to unexpected laws or other requirements. Most CSPs have implemented geographic restriction capabilities in their services to allow consumers the benefit of dispersion without undue legal/regulatory complexity.

Properly configuring the cloud service is a crucial task for the Certified Cloud Security Professional (CCSP) to meet the organization's compliance objectives. If the organization is subject to European Union (EU) General Data Protection Regulation (GDPR) requirements, it may be preferable to maintain data in the EU rather than dispersing it to all countries the CSP operates in. This may be a simple configuration for a particular service, or may require complex information technology (IT) project planning and system architecture.

One potential downside of data dispersion is latency, due to the additional processing overhead required to perform the erasure coding and reconstruct data. This is similar to the performance considerations in RAID setups that implemented a parity bit. The additional time and processing capacity may introduce system latency, which can have negative consequences on system availability. This is especially true for high-volume, transaction-based systems or for systems with data that is highly dynamic like fileshares.

DESIGN AND IMPLEMENT CLOUD DATA STORAGE ARCHITECTURES

Pooling resources is one of the key elements of cloud services. Virtualized pools of storage are much more flexible than installing and maintaining individual disks or storage networks, but it can be difficult to identify exactly where and how data is being stored in these broad pools. Understanding the storage options available in different cloud service models is essential. The CCSP should be aware of general storage types, threats, and countermeasures associated with each, as well as the specific offerings of their chosen CSP. Traditional decisions such as directly choosing SSDs for high-speed storage needs may be replaced instead by a set of configurable parameters including the quantity and speed of data access.

Storage Types

Each of the three cloud service models offers unique storage types designed to support the needs and use cases of the particular service model. These options are detailed next. It is important for the CCSP to note that CSPs may offer storage of a particular type but with unique branding or naming conventions. There are also novel storage solutions used for specific industries that are outside the scope of the CCSP exam.

IaaS

Infrastructure as a Service (IaaS) typically offers consumers the most flexibility and also requires the most configuration. Types of storage available in IaaS include the following:

  • Ephemeral: Unlike the other storage types discussed in the following sections, ephemeral storage is not designed to provide extended storage of data. Similar to random access memory (RAM) and other volatile memory architectures, ephemeral storage lasts as long as a particular IaaS instance is running, and the data stored in it is lost when the virtual machine (VM) is powered down. Ephemeral storage is often packaged as part of compute capability rather than storage, as modern operating systems (OSs) require temporary storage locations for system files and memory swap files.
  • Raw: Raw device mapping (RDM) is a form of virtualization that allows a particular cloud VM to access a storage logical unit number (LUN). The LUN is a dedicated portion of the overall storage capacity for use by a single VM, and RDM provides the method for a VM to access its assigned LUN.
  • Long-term: As the name implies, long-term storage is durable, persistent storage media that is often designed to meet an organization's records retention or data archiving needs. The storage may offer features such as search and data discovery as well as unalterable or immutable storage for preserving data integrity.
  • Volume: Volume storage behaves like a traditional drive attached to a computer, but in cloud data storage, both the computer and drive are virtualized. Volume storage may store data in blocks of a predetermined size, which can be used for implementing data dispersion. Because the disk is virtualized, the data may actually be stored across multiple physical disks in the form of blocks along with the erasure coding needed to reconstruct the data if some blocks are missing.
  • Object: Object storage is similar to accessing a Unix sharepoint or Windows file server on a network. Data is stored and retrieved as objects, often in the form of files, and users are able to interact with the data objects using file browsers.

PaaS

Some PaaS offerings provide the ability to connect to IaaS storage types, such as connecting a volume to a PaaS VM to provide a virtual disk for storing and accessing data. There are storage types unique to PaaS, however, including the following:

  • Disk: This is a virtual disk that can be attached to a PaaS instance and may take the form of a volume or object store, depending on the CSP offering and consumer needs. Many PaaS offerings simply offer parameters for storage connected to the PaaS instance, such as speed and volume of I/O operations or data durability, and the CSP provisions appropriate storage space based on the configurations specified by the consumer.
  • Databases: This is both a storage type and PaaS offering. Platforms that can be delivered as a service include popular database software such as Microsoft SQL Server and Oracle databases, as well as CSP-specific offerings such as AWS Relational Database (RDS) or Microsoft Azure Databases. In most cases, these databases will be offered in a multitenant model with logical separation between clients, and data is accessed via API calls to the database.
  • Binary Large Object (blob): Blobs are unstructured data; that is to say, data that does not adhere to a particular data model like the columns in a database. These are often text files, images, or other binary files generated by applications that allow users to generate free-form content; it is possible to apply some loose organization. This is similar to manually organizing files into folders such as word processing files by date of writing or photos by vacation destination. Blob storage services such as AWS Simple Storage Service (S3) and Azure Blob Storage apply these concepts to large volumes of blob data and typically make it available to applications or users via a URL.

Some types of storage platforms or storage types may be specific to a particular CSP's offerings. Examples include blob storage for unstructured data in Microsoft Azure, and a variety of queue services available in AWS that support the short-term storage, queueing, and delivery of messages to users or services.

SaaS

SaaS offerings are the most abstracted service model, with CSPs retaining virtually all control including data storage architecture. In some cases, the data storage type is designed to support a web-based application that permits users to store and retrieve data, while other storage types are actual SaaS offerings themselves, such as the following:

  • Information storage and management: This storage type allows users to enter data and manipulate it via a web GUI. The data is stored in a database managed by the CSP and often exists in a multitenant environment, with all details abstracted from the users.
  • Content and file storage: Data is stored in the SaaS app in the form of files that users can create and manipulate. Examples include filesharing and collaboration apps, as well as custom apps that allow users to upload or attach documents such as ticketing systems.
  • Content delivery network (CDN): CDNs provide geographically dispersed object storage, which allows an organization to store content as close as possible to users. This offers advantages of reducing bandwidth usage and usually delivers lower latency for end users as they can pull from a server physically closer to their location.

One feature of most cloud data storage types is their accessibility via application program interfaces (APIs). The virtualization technologies in use for cloud services create virtual connections to the pooled storage resources, replacing physical cable connections. These APIs mean the storage types may be accessible across from more than one service model; for example, object storage may be accessed from a PaaS or SaaS environment if an appropriate API call is made. Many CSPs have specifically architected their storage APIs for broad use in this manner, like Amazon's Simple Storage Service (S3), which is an object storage type accessible via a REST API. This enables IaaS, PaaS, SaaS, on-premises systems, and even users with a web browser to access it.

Threats to Storage Types

There are universal threats to data at rest (in storage) regardless of the location, including on-premises or legacy environments, local workstation storage, and cloud services. These affect all three elements of the confidentiality, integrity, availability (CIA) triad: unauthorized access is a threat against confidentiality, improper modification represents a threat to integrity, and loss of a storage device or loss of connectivity is a threat against availability. Tools that are appropriate for on-premises environments may not work in a distributed cloud environment, so the CCSP should be aware of how these threats impact cloud storage and appropriate countermeasures.

  • Unauthorized access: Any user accessing data storage without proper authorization presents obvious security concerns. Appropriate access controls are required by the consumer to ensure only properly identified and authorized internal users are able to access data stored in cloud services. Because of the multitenant nature of cloud storage, the CSP must provide adequate logical separation to ensure cloud consumers are not able to access or tamper with data that does not belong to them.
  • Unauthorized provisioning: This is primarily a cost and operational concern. The ease of provisioning cloud services is one of the selling points versus traditional, on-premises infrastructure. This ease of use can lead to unofficial or shadow IT, which drives unrestricted growth in the cloud services and associated costs. Unauthorized storage can also act as a blind spot when it comes to security; if the security team is not aware of where the organization is storing data, the team cannot take appropriate steps to secure that data.
  • Regulatory noncompliance: Certain cloud service offerings may not meet all the organization's compliance requirements, which leads to two security concerns. First are the consequences of noncompliance like fines or suspension of business operations. Second is the foundation of the compliance requirements in the first place—to protect data. Requirements like the use of a specific encryption algorithm are usually driven by a need to protect data; cloud services, which do not meet the compliance objectives, are also unlikely to offer adequate security for the regulated data being stored.
  • Jurisdictional issues: The major CSPs are global entities and offer highly available services, many of which rely on global redundancy and failover capabilities. Unfortunately, the ability to transfer data between countries can run afoul of legal requirements, particularly privacy legislation that bars the transfer of data to countries without adequate privacy protections. The features in a particular cloud storage service may support global replication by default, so it is incumbent on the CCSP to understand both their organization's legal requirements and the configuration options available in the CSP environment.
  • Denial of service: Cloud services are broadly network accessible, which means they require active network connectivity to be reachable. In the event a network connection is severed anywhere between the user and the CSP, the data in storage is rendered unavailable. Targeted attacks like a distributed denial of service (DDoS) can also pose an issue, though the top CSPs have robust mitigations in place and are usually able to maintain some level of service even during an attack.
  • Data corruption or destruction: This is not a concern unique to cloud data storage. Issues such as human error in data entry, malicious insiders tampering with data, hardware and software failures, or natural disasters can render data or storage media unusable.
  • Theft or media loss: This threat applies more to devices that can be easily accessed and stolen, such as laptops and universal serial bus (USB) drives; however, the risk of theft for cloud data storage assets like hard drives does exist. CSPs retain responsibility for preventing the loss of physical media through appropriate physical security controls. Consumers can mitigate this risk by ensuring adequate encryption is used for all data stored in the cloud, which renders the stolen data useless unless the attacker also has the key.
  • Malware and ransomware: Any location with data storage and processing abilities is at risk from malware, particularly ransomware. Attackers have become more sophisticated when writing ransomware, so it not only encrypts data stored on locally attached drives but also seeks common cloud storage locations like well-known collaboration SaaS apps. Proper access controls and anti-malware tools can prevent or detect malware activities.
  • Improper disposal: Like physical drive loss, the CSP has the majority of responsibility when it comes to disposal of hardware. Ensuring hardware that has reached the end of its life is properly disposed of in such a way that data cannot be recovered must be part of the CSP's services. Consumers can protect data by ensuring it is encrypted before being stored in the cloud service and that the encryption keys are securely stored away from the data.

DESIGN AND APPLY DATA SECURITY TECHNOLOGIES AND STRATEGIES

Data security in the cloud comprises a variety of tools and techniques. According to the shared responsibility model published by the major CSPs, consumers are responsible for securing their own data. Under most privacy legislation, the data owner, who is usually the cloud consumer, is ultimately accountable and legally liable for data breaches. However, adequately securing data requires actions by both the cloud consumer and the CSP to properly secure elements such as the hardware infrastructure and physical facilities.

Encryption and Key Management

Encryption is the process of applying mathematical transformations to data to render it unreadable. It typically requires the use of a key or cryptovariable, which is a string of data used by the cryptographic system to transform the data. The steps taken to achieve the transformation are known as an algorithm.

Modern cryptographic algorithms like Rijndael, which is part of the Advanced Encryption Standard (AES), offer protection for data that could take thousands or millions of years to break. Trying every possible combination of keys and permutation steps in the algorithm would take more time and resources than most attackers have available, but the process of encrypting and decrypting is relatively short if you know the key. Encryption is a foundational element of modern data security, particularly in cloud environments when data is stored outside of the organization's direct control.

Due to the criticality of encryption, organizations should focus attention on properly implementing and managing cryptographic systems. One particular area of focus is managing encryption keys. Auguste Kerckhoffs, a Dutch cryptographer, defined a simple doctrine that underpins key management: a cryptosystem should be secure even if everything about the system, except the key, is public knowledge. Known as Kerckhoffs's principle, this simple maxim guides security in cryptographic systems by placing the emphasis on protecting keys.

Keys should be classified at the highest data classification level available in the organization and protected as other high value assets would be. It is appropriate to implement controls across all categories including policies for creating and managing keys, operational procedures for securing and using keys, and tools and systems for handling keys. As a data asset, the keys should be protected at each stage of their lifecycle, including the following:

  • Creating strong, random keys using cryptographically sound inputs like random numbers
  • Storing keys in a secure manner, whether encrypted inside a key vault or stored on a physical device, and handling the process of storing copies for retrieval if a key is ever lost (known as key escrow)
  • Using keys securely, primarily focused on access controls and accountability
  • Sharing keys is not as common due to their highly sensitive nature, but facilities should exist for sharing public keys, securely transferring symmetric keys to a communications partner, and distributing keys to the key escrow agent
  • Archiving keys that are no longer needed for routine use but might be needed for previously encrypted data
  • Secure destruction of keys that are no longer needed or that have been compromised

FIPS 140-3 provides a scheme for U.S. government agencies to rely on validated cryptographic modules and systems, though it has become a globally recognized framework as many tools offer Federal Information Processing Standards (FIPS) validated modes for encryption. As of 2020, this standard is being phased in to replace its predecessor, FIPS 140-2. FIPS 140-3 establishes a framework and testing scheme for validating the strength of protection provided by a cryptographic module and defines levels of protection for such modules including physical tamper-evident hardware security modules (HSMs). Details on the standard can be found here: csrc.nist.gov/publications/detail/fips/140/3/final.

Cloud security practitioners will need to understand where encryption can be deployed to protect their organization's data and systems. Many CSPs offer virtualized HSMs that are validated against the FIPS 140-2 standard and can be used to securely generate, store, and control access to cryptographic keys. These virtual HSMs are designed to be accessible only by the consumer and never by the CSP and are usually easy to integrate into other cloud offerings from the same CSP.

If your organization uses multiple cloud providers or needs to retain physical control over key generation, your apps should be architected to allow for a bring-your-own-key strategy. This is more technically challenging for the organization, as hosting any on-prem systems requires more skills and resources, but it offers more control over the configuration and use of encryption, as well as physical control over the HSMs.

Encryption in cloud services may be implemented at a variety of layers, from the user-facing application all the way down to the physical storage devices. The goals of safeguarding data, such as counteracting threats of physical theft or access to data by other tenants of the cloud service, will drive decisions about which types of encryption are appropriate to an organization. Some examples include the following:

  • Storage-level encryption provides encryption of data as it is written to storage, utilizing keys that are controlled by the CSP. It is useful in cases of physical theft as the data should be unreadable to an attacker, but CSP personnel may still be able to view data as they control the keys.
  • Volume-level encryption provides encryption of data written to volumes connected to specific VM instances, utilizing keys controlled by the consumer. It can provide protection in the case of theft and prevents CSP personnel or other tenants from reading data, but it is still vulnerable if an attacker gains access to the instance.
  • Object-level encryption can be done on all objects as they are written into storage, in which case the CSP likely controls the key and could potentially access the data. For high-value data, it is recommended that all objects be encrypted by the consumer with keys they control before being stored.
  • File-level encryption is often implemented in client applications such as word processing or collaboration apps like Microsoft Word and Adobe Acrobat. These apps allow for encryption and decryption of files when they are accessed using keys controlled by the user, which prevents the data from being read by CSP personnel or other cloud tenants. The keys required may be manually managed, such as a password the user must enter, or automated through IRM, which can verify a user's authorization to access a particular file and decrypt it based on the user's provided credentials.
  • Application-level encryption is implemented in an application typically using object storage. Data that is entered or created by a user is encrypted by the app prior to being stored. Many SaaS platforms offer a bring-your-own-key ability, which allows the organization to prevent CSP personnel or other cloud tenants from being able to access data stored in the cloud.
  • Database-level encryption may be performed at a file level by encrypting database files or may utilize transparent encryption, which is a feature provided by the database management system (DBMS) to encrypt specific columns, whole tables, or the entire database. The keys utilized are usually under the control of the consumer even in a PaaS environment, preventing CSP personnel or other tenants from accessing data, and the encrypted data is also secure against physical theft unless the attacker also gains access to the database instance to retrieve the keys.

Hashing

Hashing, sometimes known as one-way encryption, is a tool primarily associated with the integrity principle of the CIA triad. Integrity deals with preventing, detecting, and correcting unintended or unauthorized changes to data, both malicious and accidental. Cryptographic algorithms called hash functions take data of any input length and perform mathematical operations to create a unique hash value. This process can be performed again in the future and the two hash values compared; if the input data has changed, the mismatched hash values are proof that the data has been altered.

Hashes form an integral part of digital signatures, which provide users the ability to verify both the integrity and the source of a message, file, or other data such as an app. The signing party calculates a hash and encrypts the hash value with their private key. A receiving party, who may be a user receiving a message, downloading an app, or pulling software from a repository, calculates a hash of the received data and then decrypts the sender's digital signature using the sender's public key. If the two hashes match, it can be assumed the data is original, as no other user would be able to change the data, calculate a hash, and use the original sender's private key to create the digital signature. A CCSP should be aware of digital signatures as a method for verifying messages and apps used in cloud environments, especially when third-party software is being integrated.

Hashes can provide multiple security services in cloud environments. They can verify copies of data like backups are accurate and can be used to verify the integrity of messages like email. They are also widely implemented in many security tools as a way to detect changes, which can indicate system compromise. File integrity monitoring is used by some anti-malware and intrusion detection systems to identify changes to key system files, and highly secure systems may create hashes of data about system hardware such as manufacturer, devices connected, or model numbers. In both cases, there is an expectation that these items should not change; comparing a current hash with a previously calculated hash can identify unwanted changes.

When implementing hash functions, it is important to choose a strong function that is collision-resistant. A collision occurs when two different inputs produce the same hash value as an output. In this case, it is impossible to rely on the hash function to prove integrity. As with many aspects of information security, there is a U.S. federal government standard related to hashes. FIPS 180-4, Secure Hash Standard (SHS), provides guidance on the Secure Hash Algorithm (SHA-3). As with FIPS 140-2 and 140-3 encryption, many popular tools and platforms provide FIPS-compliant modes for hash algorithms. More details on SHS can be found here: csrc.nist.gov/publications/detail/fips/180/4/final.

Masking

Masking is similar to obfuscation, which is discussed later in the chapter, and both are used to prevent disclosure of sensitive data. Data masking involves hiding specific elements of data for certain use cases, primarily when there is a need for data to be retrievable for some but not all users or processes. As an example, a corporate human resources (HR) system may need to store a user's Social Security number (SSN) for payment and tax purposes. Daily users accessing the HR system do not have a need to see the full SSN, so the system instead displays XXX-XX-1234, as the last four digits are needed to verify a user's identity.

Data masking can be useful in preventing unintended disclosures by limiting the amount of data displayed. It is very granular implementation of minimum necessary access. Although a user may be authorized to view full SSN information, in the daily use case of managing employee records, they do not have a need to see the full information.

Unstructured data can present problems for masking, as well as tokenization, obfuscation, and de-identification. When data is structured in a database, it is easy to identify and apply these techniques. Unstructured data can be stored in files, free-form text or comment fields in databases, or in applications that store data objects without structure. As an example, it would be quite simple to identify and apply masking to a database column labeled Social Security Number, but if some records have that data in a comments field along with other data, those records will obviously not be masked. Data handling and system use policies should dictate the proper use of information, including where to store sensitive data elements. If the organization utilizes unstructured storage or formats, data security tools must also be chosen to deal with unstructured data types.

Tokenization

Tokenization is a process whereby a nonsensitive representation of sensitive data, otherwise known as a token, is created and used. The token is a substitute to be used in place of more sensitive data like a credit card number, often called a primary account number (PAN). Rather than storing and using PANs, which is risky due to the value of a PAN, tokens can be used instead.

Tokens can be traced back to the original information by making a proper request to the tokenization service, which usually implements access controls to verify user identities and authorization to view sensitive data. Tokens are implemented for many online payment systems where credit card numbers are stored; they are not really stored in the app but are instead tokenized. When the user makes a purchase, the app supplies the token along with user identity information to the tokenization server, which if it accepts the information provided, accesses the relevant credit card data and supplies it to complete the transaction.

Using tokens instead of the actual data reduces risk by removing sensitive data needed by the application. A database of live credit card PANs would be incredibly valuable to a thief, who could use those cards to make purchases. The same database full of tokens is virtually worthless. Tokenization systems are obviously high-value targets, but due to their specialized function, it is possible to more robustly secure them versus a general-purpose application.

Although tokenization is widely used in credit card processing transactions and is a recommended control for payment card industry data security standard (PCI DSS) compliance, any sensitive data can be tokenized to reduce risk. Implementations vary, but the process of tokenizing data generally follows the same basic steps:

  1. An application collects sensitive information when a user enters it.
  2. The app secures, often using encryption, and sends the sensitive data to a tokenization service.
  3. Sensitive data is stored in the token database, and a token representing the data is generated and stored in the token database along with the sensitive data.
  4. The token is returned to the original application, which stores it instead of the original sensitive data.
  5. Any time the sensitive data is required, the token and appropriate credentials can be used to access it. Otherwise, the sensitive data is never revealed, as the tokenization service should be tightly access controlled.

Data Loss Prevention

DLP, sometimes also known as data leakage prevention, refers to a technology system designed to identify, inventory, and control the use of data that an organization deems sensitive. It spans several categories of controls including detective (identifying where sensitive data is stored and being used), preventative (enforcing policy requirements on the storage and sharing of sensitive data), and corrective (displaying an alert to the user informing them of the policy violation and preventing inappropriate action such as sending sensitive data via email).

Due to the multiuse nature of DLP, many organizations will implement only some of the functions at one time. An organization is unlikely to be successful attempting to simultaneously perform an organization-wide data inventory, deploy new technology in the form of DLP agents and network devices, and manage process changes due to the new DLP functionality. In cloud security environments, particularly when the enterprise architecture combines on-premises and cloud services, DLP can be useful for enforcing policies on the correct use of various systems, such as storing regulated data only in approved on-premises repositories rather than cloud storage.

A typical DLP installation will comprise three major components:

  • Discovery is a function of DLP that allows the organization to identify, categorize, and inventory data assets. In large organizations, this may be the first step of a phased rollout, while in smaller organizations with fewer assets, it may not be required. DLP scanners identify data storage locations belonging to the organization, typically by performing network scans to identify targets such as fileshares, storage area networks (SANs), databases, common collaboration platforms like SharePoint, and cloud services like Google Drive or Dropbox. The scan will likely require as input some details of the organization, such as IP ranges or domains, over which it will perform the scans.
    • Once the tool has a created a blueprint of the organization's network and likely storage sources, it will scan the identified targets to identify data based on common formats such as xxx-xx-xxxx, which represents a U.S. Social Security number. Organization-defined sensitive data can also be identified, such as documents that contain strings like “Confidential – for internal use only” in document footers or utilizing regular expressions to identify sensitive data the organization generates. Highly privileged credentials will likely be required, so managing access controls for the DLP scanner is a major focus. Most scanners offer the ability to categorize information based on standard categories such as PII, protected health information (PHI), payment information, etc., and some support organization-defined classification levels. The result of the scan is an asset inventory of organization data sets.
  • Monitoring is the most important function of a DLP system, which enables the security team to identify how data is being used and prevent inappropriate use. The choice of DLP tool should be made in light of its capability to monitor the platforms in use at an organization; for example, some legacy tools do not provide monitoring for emerging instant message tools like Slack. Another critical concern for monitoring is the placement of the DLP's monitoring capabilities. A network-based DLP monitoring traffic on an organization LAN is unable to monitor the actions of remote workers who are not always connected to the network and may not provide sufficient coverage to mitigate risk.
    • In-motion data monitoring is typically performed by a network-based DLP solution and is often deployed on a gateway device such as a proxy, firewall, or email server. Some DLP agents deployed on user workstations can perform in-motion monitoring as data leaves the particular machine, such as scanning the contents and attachment of an email before it is sent to identify a policy violation. This type of DLP must be placed in appropriate locations to be able to monitor unencrypted data; otherwise, users could create an encrypted tunnel, and the DLP will not be able to scan the data being sent. Workstation agent- or proxy-based tools can prevent this issue by scanning data before it is sent over an encrypted connection.
    • At-rest monitoring is performed on data in storage and is usually performed by an agent deployed on the storage device, though some network-based DLP can perform scans of storage locations with proper credentials. These can spot policy violations such as sensitive information stored outside of prescribed columns, for example, users entering credit card or PII in unencrypted notes/comments fields rather than fields where encryption, tokenization, or other controls have been applied. Compatibility is a particular concern for agent-based DLP solutions, as the organization's storage solutions must be supported for the DLP to be effective.
    • In-use monitoring is often referred to as endpoint or agent-based DLP, and it relies on software agents deployed on specific network endpoints. These are particularly useful for monitoring users interacting with data on their workstations or other devices and enforcing policy requirements on the use of those endpoints. Compatibility is a major concern for these agents as the DLP must support the devices and OSs in use. Most DLP solutions offer support for popular business operating systems, but support for some platforms such as the macOS and mobile operating systems like iOS and Android may be limited.
  • Enforcement: DLP applies rules based on the results of monitoring to enforce security policies. For example, a DLP agent running on user’s workstation can generate an alert or block the user from saving sensitive information to a removable USB drive, or from attaching the information to an email. A network-based DLP can look for information by analyzing traffic entering or leaving a network (like a virtual private cloud), or monitor information sent to and from a specific host. If the DLP detects sensitive data that is not being handled appropriately, such as unencrypted credit card information, it can either block the traffic or generate an alert. Alerts and enforcement actions taken by the DLP should be monitored and investigated as appropriate, and are a valuable source of security incident detection.

Deploying DLP in a cloud environment can be a particular challenge, especially in SaaS or PaaS service models where the organization lacks the ability to install software or that do not permit scanning such as DLP discovery. There are many cloud-native DLP tools, and the CCSP must ensure the organization's system requirements are elicited clearly, particularly which operating systems and cloud environments must be supported. DLP can create operational overhead due to the time and resources needed to scan network traffic or resources consumed on endpoints when monitoring. This impact should be considered as part of the cost-benefit analysis associated with the DLP solution deployment.

Data Obfuscation

Obfuscation is similar to data masking but is more often implemented when sensitive data needs to be used in a different situation. For example, obfuscation can remove or replace sensitive data elements when data from a live production system is copied for testing purposes. Testers are likely not authorized to view the data and can perform their jobs using synthetic data, which is similar to live data. Regulations often require the use of obfuscation or de-identification of data prior to its use for purposes outside of normal operations.

There are a number of ways to perform obfuscation on data, outlined next. These methods are often implemented as part of database export functions or in data management programs like spreadsheets, and they may also be implemented in business applications when users need to access or use a data set without requiring access to all sensitive data elements.

  • Substitution works by swapping out some information for other data. This may be done randomly, or it may follow integrity rules if the data requires it. As an example, when substituting names in a data set, the algorithm may have a set of male and female names to choose from. If the data is to be analyzed, this gender information could be important, so gender-appropriate names will be required when substituting.
  • Shuffling involves moving data around. This can be done on an individual column; for example, a name like Chris would become Hrsci, though this is fairly easy to reverse engineer. More robust shuffling can be performed by shuffling individual data points between rows; for example, swapping Mary Jones’s and Bob Smith's purchase history information. Shuffled data still looks highly realistic, which is advantageous for testing but removes identifiability.
  • Value variance applies mathematical changes to primarily numerical data like dates, accounting or finance information, and other measurements. An algorithm applies a variance to each value, such as +/− $1,000. This can be useful for creating realistic-looking test data.
  • Deletion or nullification simply replaces the original data with null values. A similar practice is redaction, where a document's sensitive contents are simply blacked out. The resulting document may not be useful if too much of the data has been redacted; nullified data may be problematic for testing as zero values are often unrealistic.
  • Encryption may be used as a tool for obfuscation, but it is problematic. Encrypted data is not useful for research or testing purposes, since it cannot be read. This challenge has given rise to the field of homomorphic encryption, which is the ability to process encrypted data without first decrypting it. Once returned to the original data set, the processed data can be decrypted and reintegrated, however, homomorphic encryption is largely experimental and therefore not widely used.

Obfuscating and de-identifying data often incorporates rules to ensure the output data remains realistic. For example, credit card numbers conform to known patterns—usually 16 digits broken into groups of four—so replacing real credit card information must use only numerals. Some other information may also need to be verifiable against external sources. For example, postal codes and telephone area codes indicate geographic locations, so modifying a data set with address, post code, and telephone numbers may need to conform to the rules of these numbering systems or the relationships that data implies. If your business does not serve customers outside the continental United States, then postal codes for Canada, Alaska, and Hawaii should not be allowed when performing substitutions.

A process known as pseudo-anonymization or pseudonymization (often pseudonymisation due to its presence in the EU GDPR and use of British English spellings by European translators) is a process of obfuscating data with the specific goal of reversing the obfuscation later. This is often done to minimize risk of a data breach and is performed by the data owner or controller prior to sending data to a processor. For example, if an organization wants to store large volumes of customer purchase orders in a cloud service, where storage is cheaper and more durable, they could remove PII and replace it with an index value prior to upload. When the data is retrieved, the index value can be looked up against an internal database, and the proper PII inserted into the records. The cost of storing that PII index would be small due to the lower volume of data, and the risk of a cloud data breach is also minimized by avoiding storage of PII in the cloud.

Data De-identification

De-identifying data is primarily used when the data contains PII, known as direct identifiers, or contains information that could be combined with other data to uniquely identify an individual, known as indirect identifiers. Direct identifiers consist of information such as name and financial account numbers, while indirect identifiers are often demographic information such as age or personal behavior details such as shopping history. Removing these identifiers makes data anonymous rather than identifiable; hence, this process is often known as anonymization. This differs from pseudonymization, where the goal is to allow the re-identification of data, unlike anonymization, which is designed to be permanent.

Most privacy regulations require data anonymization or de-identification for any PII use outside of live production environments. For example, U.S. healthcare entities regulated by the Health Insurance Portability and Accountability Act (HIPAA) are required to de-identify medical records information when it is to be used for anything other than patient treatment, such as research studies. Details that could be used to specifically identify a patient must be removed or substituted, such as full name, geographic location information, payment information, dates more specific than the year, email address, health plan information, etc.

Removing indirect identifiers can be more of a challenge, starting with identifying what information could be uniquely identifiable. Trend information such as frequently browsed product categories at an online shopping site can be cross-referenced to a user's social media posts to identify them with relative certainty. If an online retailer removed all direct identifiers and published their sales trend data, it might still be possible to uniquely identify users. Combining multiple obfuscation and anonymization techniques can be useful to combat this threat, such as deleting names, substituting postal codes, and shuffling rows so that purchase history and location are no longer linked.

IMPLEMENT DATA DISCOVERY

Data discovery has two primary meanings in the context of information security. Discovery of data stored in your environment is the purview of DLP solutions, which helps build an inventory of critical data assets your organization needs to protect. This is not the same as eDiscovery, which deals with collecting evidence in legal situations, but utilizes many of the same principles.

Discovering trends and valuable intelligence within data is the second meaning, and one that is less a dedicated concern of the security department. Analyzing data to uncover trends or make predictions, such as what products are likely to be in-demand or what a given customer might be interested in shopping for, can drive meaningful business improvements like ensuring adequate inventory of goods. In these cases, security is not a primary concern of data discovery, but the business intelligence (BI) is itself a valuable intellectual property (IP) data asset, which requires security commensurate with its value. Supporting elements such as algorithms or proprietary data models used for analysis may also be valuable IP and should be included in the organization's security risk management program.

Many security tools, especially those monitoring large-scale deployments of user workstations, servers, and cloud applications, make use of data discovery. Analysis tools can be used to drive security operations by identifying suspicious events that require investigation, such as system vulnerabilities, misconfigurations, intrusion attempts, or suspicious network behavior. This data, comprising details of an organization's vulnerabilities, is obviously an asset worth protecting as it would be useful to an attacker. It is important for a security practitioner to understand both the value of these tools in supporting security operations as well as how to adequately protect them.

There are a number of important terms associated with data discovery and BI, including the following:

  • Data lake and data warehouse: These terms are similar but not the same. Both are designed to consolidate large amounts of data, often from disparate sources inside or outside a company, with the goal of supporting BI and analysis efforts. A lake is an unstructured data storage mechanism with data often stored in files or blobs, while a warehouse is structured storage in which data has been normalized to fit a defined data model.
    • Normalization is the process of taking data with different formats—for example one system that stores MM-DD-YYYY and another that uses YYYY-MM-DD—and converting it to a common format. This is often known as extract, transform, load (ETL), as the data is extracted from sources like databases or apps, transformed to meet the warehouse's data model, and loaded into warehouse storage. Normalizing data improves searchability.
  • Data mart: A data mart contains data that has been warehoused, analyzed, and made available for specific use such as by a particular business unit. Data marts typically support a specific business function by proactively gathering data needed and performing analysis and are often used to drive reporting and decision-making.
  • Data mining: Mining data involves discovering, analyzing, and extracting patterns in data. These patterns are valuable in some way, much like minerals mined from the ground; they may support business decision-making or enhance human abilities to identify important trends, patterns, and knowledge from large sets of data. Even small organizations with only a few users and systems generate enormous volumes of security log data, and data mining tools can be useful for isolating suspicious events from normal traffic.
  • Online analytic processing (OLAP): As the name implies, OLAP provides users with analytic processing capabilities for a data source. OLAP consists of consolidation, drill-down, and slice-and-dice functions. Consolidation gathers multidimensional data sets into cubes, such as sales by region, time, and salesperson. Drill-down and slice-and-dice allow users to analyze subsets of the data cube, such as all sales by quarter across all regions or sales of a particular product across all salespeople. Security incidents that require forensic analysis often make use of OLAP to extract relevant information from log files.
  • ML/AI training data: Machine learning (ML) and artificial intelligence (AI) are emerging areas of data science. ML is concerned with improving computer algorithms by experience, such as asking a computer to identify photos of dogs and then having a human verify which photos actually contain dogs and which contain other animals instead. The computer algorithm learns and refines future searches; these algorithms can be used across a wide variety of applications such as filtering out unwanted emails by observing which messages users mark as spam, and they are widely implemented in security tools designed to learn and adapt to an organization's unique environment. AI is a field of computer science with the goal of designing computer systems capable of displaying intelligent thought or problem solving, though the term is often used to describe systems that simply mimic human tasks like playing strategy-based board games or operating autonomous vehicles. Both AI and ML require large sets of data to learn, and there are myriad security as well as privacy concerns, especially when the data sets include personal information like photos used for training ML facial recognition.

As a consumer of cloud services, a CCSP should be aware that their organization retains accountability for protecting data, including data discovery processes to identify and inventory sensitive data. CSPs may be subject to contractual obligations for implementing specific data protections, but they are not the data owners and are therefore not legally liable under most privacy and security laws for data breaches. An adequate inventory of sensitive data is a vital input to security risk assessment and mitigation, so it is essential that a CCSP recognize the capabilities of data discovery and utilize them to support their organization's security risk management process.

Structured Data

Structured data refers to data that has been formatted in a consistent way. This often takes the form of a database where all records conform to a known structure: data is separated into columns, and each row contains the same type of information in the same place. Standalone data may also be structured using a markup language such as XML or JSON, which utilizes tags to provide context around data like <AcctNumber>123456</AcctNumber>.

Data discovery is simplified with structured data, as the process only needs to understand the data's context and attributes to identify where sensitive data exists, such as PII, healthcare data, transaction information, etc. Columns and data attributes are typically named in a self-explanatory way, simplifying the identification of sensitive or useful data. Many security information and event management (SIEM) tools also provide functionality to ingest data from multiple sources and apply structure to the data, which facilitates the process of analysis across disparate sources. As an example, the SIEM tool might normalize log data from multiple operating systems to include human-readable usernames in log files, rather than a numeric global user ID value.

Structured data is often accompanied by a description of its format known as a data model or schema, which is an abstract view of the data's format in a system. Data structured as elements, rows, or tuples (particularly in relational databases) is given context by the model or schema. For example, defining a particular string as a user ID can be achieved using tags defined in a schema. Understanding the relationship of a user belonging to a particular business unit can be achieved with data in a particular column; for example, the user's business unit designation appears in the “Bus. Unit” column. These relationships and context can be used to conduct complex analysis, such as querying to see all failed login attempts for users from a specific business unit.

Metadata, or data that describes data, is a critical part of discovery in structured data. Semantics, or the meaning of data, is described in the schema or data model and can be useful when analyzing the relationships expressed in data. A particular user record, for example, may contain a tag <BusUnit>EMEA</BusUnit>, which identifies that the user belongs to the EMEA business unit and might be considered sensitive information as it provides some location information for that user. Similarly, column names can also be used to identify specific types of regulated data, such as credit card numbers, which require specific protections.

Unstructured Data

Structured data simplifies the process of analysis by providing context and semantics, which speed up discovery and analysis. Unfortunately, not all data is structured—human beings tend to create data in a variety of formats like documents or photos containing infinite types and configurations of data. Unstructured data refers to information stored without following a common format. For example, credit card data may be stored in tables or as strings inside a word processing document or in a spreadsheet with user-defined columns like CC, Card Number, PAN, etc. The variety of unstructured data makes it harder to identify and analyze, but it is nonetheless valuable and therefore requires protection.

Applying data labels is one approach to dealing with unstructured data. Labels can identify the classification level of a particular file and, by extension, the protections required. Files or other objects stored in a particular system may have a label applied by virtue of being stored in that system; for example, all documents stored in a “Restricted” fileshare are also given a “Restricted” classification. Labels may also be applied individually via metadata in a file management system or inside documents as a header, footer, or watermark. Security tools such as DLP should be able to detect these unstructured files based on the labels and take appropriate actions, such as blocking files with “Restricted” in the footer from being sent as email attachments. Note that this is an imperfect approach, as users can often delete this data or use incorrect templates and thereby mislabel the data.

Another approach to unstructured data discovery is content analysis, which requires a great deal of resources to parse all data in a storage location and identify sensitive information. Analysis can be performed using one of several methods, such as the following:

  • Pattern matching, which compares data to known formats such as credit card numbers that are 16 numeric digits, or unique organization-defined patterns such as user account information like “j.smith.” Patterns are typically defined using a regular expression or regex, which allows for more powerful search capabilities by defining not just exact match conditions, but flexible conditions as well. For example, searching for [email protected] would return only exact matches of that email address. If the user has both j.smith and john.smith aliases, a regex can be created to search for j*.[email protected], which returns both email aliases.
  • Lexical analysis attempts to understand meaning and context of data to discover sensitive information that may not conform to a specific pattern. This is useful to flag highly unstructured content like email or instant message communications where users may utilize alternate phrasing like “payment details” instead of “card number.” However, it is prone to false positives as linguistic meaning and context are quite complex.
  • Hashing attempts to identify known data such as system files or important organization documents by calculating a hash of files and comparing it to a known set of sensitive file hashes. This can be useful for documents that do not change frequently.

One particular challenge for data discovery is the inclusion of unstructured data inside structured data sets. This often occurs as an unstructured text field in an otherwise structured database, like a free-form notes or comments field into which users can enter any information. Systems that support both types of data are also problematic, like ticketing systems with form fields and file attachments. In both scenarios, users are required to enter information into defined fields, but they may enter or upload anything in free-form text or file attachments. The result is a system with a wide variety of data at differing classification levels, including more sensitive data than originally planned. The organization's data discovery tool must be flexible enough to identify both types of data within the same system, which increases cost and complexity.

IMPLEMENT DATA CLASSIFICATION

Class has many meanings, but as it relates to data, it is a way to identify common attributes that drive protection requirements. These may include regulated categories such as PII, which is covered by a variety of regulatory and legal schemes, such as privacy legislation, or internal schemes such as Business Confidential information, which offers your organization some competitive advantage and therefore should be highly protected.

Classification is the act of forming classes, or groups, by identifying these common attributes. The term categorization is often used, especially when discussing systems or large data sets, and describes the process of determining which class a system or data set belongs to by eliciting requirements for protecting the system's confidentiality, integrity, and availability. Each classification level should have an associated set of control expectations, such as data classified as “Public” does not require encryption in transit, while data classified as “Internal Use Only” must be encrypted both at rest and in transit. These requirements are mitigations for the risk presented to the organization by the data or system, as described by the operational impact of a loss of confidentiality, integrity, and/or availability.

Data classification is a way for organizations to provide a uniform set of controls and expectations, as well as a method for speeding up security decision-making. Creating a classification scheme, such as Low, Moderate, and High, allows the organization to bundle security control expectations and simplify the process of determining required actions. When a new system is brought online, security practitioners do not need to perform exhaustive research to determine the security requirements they need to meet. The classification scheme provides a clear set of risk-based security controls and expectations designed to speed up the process.

Data classification levels and schemes are driven by multiple aspects of data. They may be prescribed by an outside entity such as a regulator or government agency or may be driven by purely internal risk management requirements. Here are some examples:

  • Data type: Different types of data are regulated by different rules, such as healthcare, sensitive PII, financial, educational, or legal. Data classification schemes based on data type often point to a set of external requirements that must be met if a system or data set includes the given data type.
  • Legal constraints: If data on EU citizens is collected by a company based in another country, that company may have to either implement privacy protection similar to the EU's GDPR or be based in a country with privacy laws recognized by the EU as equivalent to GDPR. Understanding the legal constraints attribute allows the organization to make decisions such as geolocation of application architecture.
  • Ownership: Many organizations utilize data that is shared by business partners, customers, or shared sources, all of which may impose requirements such as not sharing the data with third parties or securely destroying data after a specified retention period.
  • Value/criticality: Some data's value is incredibly context-specific. A database of contact details for restaurant suppliers is of little value to an IT services company, but that same database would be mission-critical to a company operating a chain of restaurants. The data classification scheme must take into account how valuable and critical data is to the organization, often by measuring the impact that a loss of the data would have on operations.

Because of the information provided in a data classification policy, it is often a foundational document for an organization's security program. Rather than specifying a long list of system-specific security requirements for each system individually, such as approved encryption or data retention schedules, a classification label provides a common vocabulary for communicating security needs in a consistent manner. Other information security policies should specify the appropriate controls for data or systems at various classification levels, like approved cryptographic modules, access control procedures, and data retention periods and destruction requirements.

Mapping

Data mapping comprises a number of activities in the overall practice of data science—the application of scientific methods and algorithms to identify knowledge or useful information from data sets. One particular practice related to data mapping is relevant to the role of a security practitioner: identifying the locations of data.

Identifying and mapping the location of data within the organization is a critical inventory task, which is in turn a critical input to risk assessments. Identifying what needs protecting—system and data assets belonging to the organization—and where they exist are crucial to ensure a security program is designed appropriately. Many DLP tools provide this functionality by scanning a network, domain, or other set of organization-controlled resources to identify data storage locations. This mapping may be further extended by identifying metadata such as asset ownership like a person, role, or organizational unit, which provides the organization with key information on responsibility and accountability for security processes.

Labeling

Once a data set or system has been classified, the classification level must be communicated in some way so that users, administrators, and other stakeholders know how to protect it. Again, the use of labels provides simplification. Rather than forcing users to memorize which systems they can print data from and which systems ban printing, users are instead instructed that systems labeled “Internal” allow printing, while systems labeled “Top Secret” do not allow printing.

Labeling data can be tricky, as we typically think of labels in physical terms but obviously are not able to stick a label on a collection of digital 0s and 1s. There are a variety of labeling methods for different types of assets, such as the following:

  • Hard-copy materials, primarily printed information on paper, which can be labeled with a printed watermark, stamps, or a physical container such as a folder or box. Hard-copy materials are the easiest to affix labels to because they are physical objects and do not change often.
  • Physical assets, including servers, workstations, disc drives, optical disks, and removable media, which can be physically labeled with a sticker or badge. These are somewhat tricky to label as the information on these devices can change quite easily. It can also be more challenging to identify and label found physical assets, as the user needs to have appropriate equipment to read the data on the asset; there may also be security issues around plugging in found media due to the possibility of introducing malware.
  • Digital files, which may come from common collaboration tools, databases, or other programs, and can often be labeled with metadata. This may include content inside the document such as a digital watermark or signature, or even text like a document footer with the classification level printed. File metadata such as the filename or document attributes stored in a database can also be used to apply a classification label.
  • Some complex or shared systems and data sets will have subcomponents that can be labeled, but the overall system cannot. In these cases, labeling of components along with supporting procedures, such as training and reference materials for users or a master organization-wide list of systems, can be useful to ensure users are aware of the protection requirements they must meet.

When it comes to labeling data sets, there are a number of best practices that facilitate the use of the classification level to ensure adequate protection is applied. The first is to ensure labels are appropriate to all types of media in use within a system; for example, if both digital files and hard-copy documents are to be used, a digital watermark that also appears when the document is printed helps ensure the data label is visible across all media states. Labels should be informative without disclosing too much—stamping a folder “Top Secret” makes it easy to recognize for both legitimate users and bad actors! Labels may include only an owner or asset number that can be used to determine the sensitivity of the data in the event it is lost. Finally, when media is found, it should be classified at the highest level supported by the organization until an examination proves otherwise, ensuring sensitive data is not disclosed.

The organization's DLP tool may be a crucial consumer of data labels, as many DLP tools allow organization-defined labels to be used when performing data identification and classification. If DLP is being used, labels should be applied in a consistent and accessible manner, such as text in the file identifying the classification or common filename conventions to facilitate the discovery process.

Sensitive Data

The organization's data classification policy should cover a number of requirements for handling data, many of which will be driven by external laws and regulations. These external obligations will typically provide guidance for handling sensitive classes of information such as the following:

  • Personally identifiable information (PII): Governed globally by privacy laws and often by laws or regulations specific to certain industries covering the collection, use, and handling of PII. Examples include the EU GDPR and Canadian Personal Information Protection and Electronic Documents Act (PIPEDA) laws, which broadly cover PII, and the U.S. Graham-Leach-Bliley Act (GLBA), which covers banking uses of PII.
  • Protected health information (PHI): Defined and governed primarily by the U.S. HIPAA, though personal health records are considered PII by most global privacy laws such as GDPR.
  • Cardholder data (often referred to as a cardholder data environment, or CDE): Defined and regulated by PCI DSS, it provides guidance on the handling, processing, and limited allowable storage of information related to credit and debit cards and transactions.

Data protection should be specified for all sensitive data discovered, and may be a mix of requirements defined in the various laws mentioned earlier as well as an organization's own risk-management goals. Some elements appropriate to specify in a data classification policy include the following:

  • Compliance requirements inherent at various classification levels: While this may be too complex for an average user, it ensures vital security requirements are not overlooked. As a best practice, points of contact who are skilled at managing sensitive data should be identified in the policy so users can seek assistance as needed.
  • Data retention and disposal requirements: Many data protection laws specify retention periods, such as customer records must be held indefinitely while the customer is still active and then for five years thereafter. Classification and retention policies and procedures should be tightly aligned and provide guidance on approved disposal or destruction methods for data that has reached the end of its retention period.
  • What is considered sensitive or regulated data: Some regulations include exceptions for a variety of circumstances, and an untrained individual may not fully understand the subtle nuances of when healthcare data is considered PHI or not, for example. The classification policy should provide clear guidance on what data is considered sensitive or regulated and explicitly state any exceptions that may apply.
  • Appropriate or approved uses of data: Many regulations provide explicit guidance on approved use and processing of data, frequently related to the intended purpose or consent given by the data subject. Classification policies must provide guidance on how to identify these approved uses, such as with explicit system instructions or a point of contact who can provide definitive guidance.
  • Access control and authorization: Controlling logical and physical access to assets is, along with encryption, one of the most powerful tools available to security practitioners. Classification can be used to determine access rights; for example, only users in the payments team are allowed to see plaintext payment card data to process customer transactions. This clearly identifies the need for obfuscation, tokenization, or other methods of blocking users on other teams from accessing the data.
  • Encryption needs: Encryption is a multipurpose tool for security and privacy, so the application and configuration of encryption must be clearly documented for users to ensure it is properly applied.

DESIGN AND IMPLEMENT INFORMATION RIGHTS MANAGEMENT

Since data is highly portable and there is great value in collaborating and sharing access, it is often necessary to ensure an organization's security controls can be extended to offer protection for the data wherever it might be. Data that is shared outside the organization will likely end up on information systems and transit networks not controlled by the data owner, so a portable method of enforcing access and use restrictions is needed: information rights management, sometimes also called digital rights management (DRM). There are two main categories of IRM.

  • Consumer-grade IRM is more frequently known as DRM and usually focuses on controlling the use, copying, and distribution of materials that are subject to copyright. Examples include music, videogame, and application files that may be locked for use by a specific (usually paying) user, and the DRM tool provides copy protections to prevent the user from distributing the material to other, nonpaying users.
  • Enterprise-grade IRM is most often associated with digital files and content such as images and documents. IRM systems enforce copy protection as well as usage restrictions, such as PDFs that can be read but prevent data from being copied or printed, and images that can be accessed for only a certain duration based on the license paid for. IRM can also be a form of access control, whereby users are granted access to a particular document based on their credentials.

IRM is often implemented to control access to data that is designed to be shared but not freely distributed. This can include sensitive business information shared with trusted partners but not the world at large, copyrighted material to which a user has bought access but is not authorized to share, and any information that has been shared under a license that stipulates limitations on the use or dissemination of that information.

Objectives

Most IRM solutions are designed to function using an access control list (ACL) for digital files, which specifies users and authorized actions such as reading, modifying, printing, or even onward sharing. Many popular file sharing SaaS platforms implement these concepts as sharing options, which allow the document owner to specify which users can view, edit, download, share, etc.

IRM systems should ideally possess a number of attributes, including the following:

  • Persistence: The ACL and ability to enforce restrictions must follow the data. Some tools allow users to set a password required to open a document, but the tools also allow other users to disable this password-based access control, which defeats the purpose.
  • Dynamic policy control: The IRM solution must provide a way to update the restrictions, even after a document has been shared. If users no longer require access, the IRM solution must provide a way for the document owner to revoke the permission and enforce it the next time the document is opened regardless of its location. This leads to a key usability challenge, as IRM tools often require users to have an active network connection so the policy can be validated before access is granted.
  • Expiration: IRM tools are often used to enforce time-limited access to data as a form of access control, which reduces the amount of time a bad actor has to exploit a document to which they have gained unauthorized access. While this can be an element of dynamic policy control, which requires the ability to query an IRM server, it may also be done by calculating and monitoring a local time associated with the file. One example is the timer that begins when a user first starts playback of a rented digital movie and restricts the user's ability to play the movie to 24 hours after that initial start time.
  • Continuous audit trail: Access control requires the ability to hold users accountable for access to and use of data. The IRM solution must ensure that protected documents generate an audit trail when users interact with protected documents to support the access control goal of accountability.
  • Interoperability: Different organizations and users will have a variety of tools, such as email clients and servers, databases, and operating systems. IRM solutions must offer support for users across these different system types. Document-based IRM tools often utilize a local agent to enforce restrictions, so support for specific operating systems or applications is a critical consideration. System-based IRM tools, such as those integrated into document repositories or email systems, are capable of broad support despite the user's OS, but may offer limited support for user applications like browsers or email clients. Lastly, sharing documents across systems can be challenging, especially with outside organizations who may utilize different services such as Microsoft Office and Google Apps for collaboration tools.

IRM restrictions are typically provisioned by a data owner, whose responsibilities will vary depending on the access control model being used. In a discretionary access control (DAC) model, the owner is responsible for defining the restrictions on a per-document or data set basis. This may involve manual configuration of sharing for documents, specifying user authorizations for a database, or defining users and their specific rights for a data set. In nondiscretionary access control models such as mandatory access control (MAC), the owner is responsible for specifying metadata like a classification rating or a user role. The IRM system then utilizes this metadata to enforce access control decisions, such as allowing access to users with the same clearance level or denying users who are not assigned specific job roles.

Appropriate Tools

IRM tools comprise a variety of components necessary to provide policy enforcement and other attributes of the enforcement capability. This includes creation, issuance, storage, and revocation of certificates or tokens, which are used to identify authorized users and actions. This requires a centralized service for identity proofing and certificate issuance, as well as a store of revoked certificates that can be used to identify information access that is no longer authorized. This model is used for software distribution via app stores, where developers digitally sign code and user devices validate the signature each time the app is launched. This can ensure that the device is still authorized and the user is still the authentic license holder, but also offers the ability for the entity controlling the app store to prevent apps from running if their certificates have been revoked. Such a solution obviously requires network connectivity between devices and the centralized management system.

Both centralized and decentralized IRM solutions will require local storage for encryption keys, tokens, or digital certificates used to validate users and access authorizations. This local storage requires protection primarily for the integrity of data to prevent tampering with the material used to enforce IRM. Modifying these access credentials could lead to loss of access control over the IRM-protected data; for example, a user might modify the permissions granted to extend their access beyond what the data owner originally specified.

PLAN AND IMPLEMENT DATA RETENTION, DELETION, AND ARCHIVING POLICIES

Data follows a lifecycle starting with its creation and ending when it is no longer needed by the organization. There are many terms used to describe this end-of-life phase including disposal, retention, archiving, and deletion. Although sometimes used interchangeably, they are unique practices, and a CCSP must be aware of requirements that mandate the use of one specific option.

Data disposal is most often associated with the destruction or deletion of data, though the term may be used to mean disposition, which implies a change of location for data such as moving it from active production to a backup environment. While data is still required for use by the organization or must be held for a set period of time to meet a regulatory or compliance objective, the practice of data retention will be used. Once data is no longer needed by the organization and is not subject to any compliance requirements for retention, it must be deleted using tools and processes commensurate with its value.

Data archiving is a subset of retention typically focused on long-term storage of data not required for active processing or that has historical value and may therefore have high integrity requirements. Data retention, archive, and destruction policies are highly interconnected, and the required practices may even be documented in a single policy or set of procedures governing the use, long-term storage, and secure destruction of data.

Data Retention Policies

Data retention is driven by two primary objectives: operational needs (the data must be available to support the organization's operations) and compliance requirements, which are particularly applicable to sensitive regulated data such as PII, healthcare, and financial information. Many regulatory documents refer to data as records; hence, the term records retention is often used interchangeably. Data retention policies should define a number of key practices for the organization and need to balance organizational needs for availability, compliance, and operational objectives such as cost.

A CCSP should recognize several use cases for cloud services related to backups, which are often a key subset of data retention practices related to availability. Cloud backup can replace legacy backup solutions such as tape drives, which are sent to offsite storage. This offers the benefit of more frequent backup with lower overhead, as sending data over a network is typically cheaper than having a courier pick up physical drives, but the organization's network connection becomes a single point of failure.

Cloud backup can be architected as a mere data storage location, or a full set of cloud services can act as a hot site to take over processing in the event the on-premises environment fails. This scenario may be cost effective for organizations that are unable to use the cloud as a primary processing site but do not want to incur the costs of a full hot site. Temporary use of the cloud as a contingency helps to balance cost and security.

Cloud services, particularly SaaS and PaaS deployments, may offer intrinsic data retention features. Most cloud storage services provide high availability or high durability for data written into the environment, which allows the organization to retain vital data to meet operational needs. Some services also offer compliance-focused retention features designed to identify data or records stored and ensure compliance obligations are met. In all cases, the CCSP needs to be aware of the features available and ensure their organization properly architects or configures the services to meet internal and compliance obligations.

Storage Costs and Access Requirements

Data retention has associated storage costs, which must be balanced against the requirements of speed to access it. In many cases, older data, such as old emails or archived documents, is accessed less frequently. This data may not be accessed at all for routine business matters, but is only needed for exceptional reasons like archival research or a legal action where it is acceptable to wait a few hours for a file to be retrieved. Organizations may also have compliance or regulatory obligations to retain data long after it is no longer useful for daily operations. The costs associated with this storage can be significant, so CSPs offer a variety of storage services that balance cost and retrieval speeds; typically the solutions offer a combination of either low price/retrieval speed or higher price and quick retrieval. As an example, Amazon Simple Storage Service (S3) offers higher-priced S3 Standard, where retrieval is in real time, or lower-priced S3 Glacier, where retrieval time ranges from 1 minute to 12 hours. Similarly, Microsoft's Azure Blob Storage offers Hot, Cool, and Archive tiers, in order of higher cost/retrieval speed to lower cost/speed.

To model this cost-benefit analysis, consider Alice's Blob Cloud (ABC), which offers the following storage service levels (currency is USD per gigabyte):

  • Rabbit Storage: $0.5, real-time access (>50 milliseconds)
    • Storing 5TB of data (5000GB) would cost $2,500/month or $30,000/year.
  • Turtle Storage: $0.005, access times from 1 to 12 hours
    • Storing 5TB of data (5000GB) would cost $25/month or $300/year.

The costs savings of using Turtle are significant: $29,700 per year in this limited example. Most organizations will generate significantly more than 5TB of data. If the data is used infrequently, such as once a quarter, the data retention policy should specify appropriate storage options to balance costs with the access speed requirements. This may be part of the records retention schedule (for example, all data that is older than 12 months is moved to Turtle) or as part of the data classification policy (for example, live production data is encrypted at rest and stored in Rabbit, while archive production is encrypted and stored in Turtle).

Specified Legal and Regulatory Retention Periods

All organizations should retain data for as long as it is functionally useful; otherwise, the organization faces a loss of availability. This may be marked by obvious milestones such as the end of a customer relationship or project, after which the data is no longer valuable to the organization and can be deleted. Any organization dealing with regulated data like PII is also likely to have external obligations for data retention, which are often legally enforceable. Some examples of external regulations include the following:

  • HIPAA: Affects all U.S. residents and specifies a six-year retention period for documents, such as policies and procedures, relevant to the HIPAA compliance program. Retention of patient medical data is not directly mentioned in HIPAA but is specified by many state-level laws that require medical records be retained for as long as a patient is active and then for a set period of time thereafter.
  • EU GDPR: Affects data of EU citizens; it does not set a specific retention period but rather provides for indefinite retention so long as a data subject has given consent and the organization has a legitimate need for the data. If consent is revoked or the organization must act on the revocation by deleting, destroying, or anonymizing the data.

Data Retention Practices

The organization's data retention policy should specify what data is to be retained and why. Procedures should also be documented to specify how those retention goals will be met, including details regarding the following:

  • Schedules: Data and records retention often refers to a schedule of retention, which is the period of time the data must be held for. These are often included in data labels to enable the discovery of data that is no longer required and can be destroyed.
  • Integrity checking: Whenever data is copied, there is a chance something could go wrong, and data stored may fall victim to environmental issues like heat or humidity, which damage the storage media. Integrity checking procedures should be established to verify data when it is written and periodically thereafter to ensure it is readable and complete.
  • Retrieval procedures: Data may have different access requirements across different environments. Typically, there will be more users authorized to access data in a live production environment as it is needed to support operations, while data in an archive may only be needed by a more limited set of users, like the legal department when responding to a lawsuit or auditors reviewing regulatory compliance. Retrieval procedures should include proper authorization artifacts like approved access requests and enforce accountability for data usage.
  • Data formats: The format of data, including programs, apps, and hardware needed to read and write it, requires consideration. Over time, file formats and hardware change, so procedures such as virtualizing legacy environments, purchasing reserves of obsolete equipment, or converting data to new formats may be appropriate.

Data Security and Discovery

Retained data will face unique security challenges, particularly driven by the fact that it is long lived and may be difficult to access or manipulate as threat conditions change over time. For example, data encryption standards evolve quite frequently, so today's state-of-the art cryptography may be trivially easy to crack in a few years. It may not be technically or financially feasible to decrypt data retained in archives and re-encrypt using more robust cryptography. Security practitioners should consider defense-in-depth strategies such as highly secure key storage and tightly limited access control over archival data as a compensating control for weaker legacy encryption standards. Similarly, keys that are no longer actively used for encrypting data in production environments may need to be securely stored to grant access to archival data.

Data retention is often a requirement to support after-the-fact investigations, such as legal action and review by regulators. Data retention methods should support the ability to discover and extract data as needed to support these compliance obligations. For legal actions the requirements of eDiscovery must be considered. eDiscovery is covered in detail in Domain 6: Legal, Risk, and Compliance, but in short, it is the ability for data to be queried for evidence related to a specific legal action, such as all records during a certain time period when fraudulent activity is suspected.

Data Deletion Procedures and Mechanisms

When data is no longer needed for operational needs and has been retained for the mandated compliance period, it can be disposed of. It may, however, still be sensitive, such as a medical records of a patient who is no longer being treated at a particular facility, but is still living and is legally entitled to privacy protections for their medical data. In this case, simply disposing of the data by selling old hard drives or dumping paper files into the trash would open the organization and the patient to risk, so proper controls must be enforced to ensure the confidentiality of information remains intact during destruction.

NIST SP 800-88, Guidelines for Media Sanitization, is a widely available standard for how to securely remove data from information systems when no longer required. It defines three categories of deletion actions for various types of media to achieve defensible destruction—the steps required to prove that adequate care was given to prevent a breach of data confidentiality. These categories, in hierarchical order based on protection they provide, are as follows:

  • Clear: The use of tools to remove or sanitize data from user-addressable storage. Clearing may include standard operating system functions like deleting data from a trash can/recycle bin, which merely renders the data invisible but recoverable using commonly available tools. These options are typically the fastest and lowest cost but are inappropriate for very sensitive data.
  • Purge: The use of specialized tools like overwriting drives with dummy data, physical state changes such as magnetic degaussing, or built-in, hardware-based data sanitization functions designed to provide secure destruction of data. Purged media can typically be reused, which may be a cost factor to consider, but the time required to perform purge actions like writing 35 passes of dummy data over a modern high-capacity hard drive might make it infeasible. Purging may also shorten the lifespan of media to such an extent that its remaining useful life is negligible. Data may be recovered from purged media using highly specialized tools and laboratory techniques and may be appropriate for moderate risk data where no determined attacker with adequate means exists.
    • Cryptographic erasure or Cryptoshredding is a form of purging that utilizes encryption and the secure destruction of the cryptographic key to render data unreadable. This is effectively a positive denial-of-service attack and is often the only option available for cloud-based environments due to loss of physical control over storage media, the use of SSDs for storage that cannot be reliably overwritten, and the dispersion of data in cloud environments. Organizations utilizing the cloud can encrypt all data using organization-controlled keys, which can be securely destroyed, rendering data stored in the cloud economically infeasible to recover. Modern smartphone, tablet, and workstation operating systems also implement this feature using technologies such as Apple FileVault or Microsoft BitLocker, which save costs of purging and extends the useful life of storage media.
  • Destroy: The most drastic option that renders storage media physically unusable and data recovery infeasible using any known methods. Destruction techniques include physical acts like disintegrating, pulverizing, melting, incinerating, and shredding. It is unlikely a CSP will provide the capability for cloud consumers to physically destroy media, but this may be an appropriate control for the CSP to implement for information system components that contain sensitive customer data but that are no longer needed.

The choice of data deletion procedures should be driven by a cost-benefit analysis. Cost including replacement of the media, fines, or legal settlements if a data breach occurs, and the actual implementation of destruction must all be taken into account. As an example, hard drives containing high-sensitivity information may simply be cleared if they are to be reused in the same environment where the risk of a data breach is low, but may be physically destroyed if they are leaving the organization's control. The full NIST SP 800-88 document covering data destruction and associated risk factors can be found here: csrc.nist.gov/publications/detail/sp/800-88/rev-1/final.

Data Archiving Procedures and Mechanisms

Data archiving refers to placing data in long-term storage for a variety of purposes: optimizing storage resources in live production environments and meeting the organization's retention requirements are both examples. The procedures and mechanisms in place need to ensure adequate security controls are in place for data as it moves from live systems to the archive, which may implement significantly different controls for access control and cryptography.

Access controls for production environments are typically more complex than for archive environments, but that does not mean the archive deserves less rigor. Archivists or retention specialists may be the only users authorized to routinely access data in the archive, and procedures should be in place to request, approve, and monitor access to the data. Procedures governing the handoff between production and the archive should be documented as well, to ensure the change of responsibility is well understood by the personnel assigned.

Cryptography serves multiple important roles for archival data just as it does in other environments. Data in transit to an archive will need to be protected for both confidentiality and integrity; for cloud-based systems or backup tools, this will entail encrypting data in transit to preserve confidentiality as the data moves between on-premises and cloud environments, and verifying the integrity of data copied to the cloud environment.

Hashing may be appropriate for data with high integrity requirements. Data can be hashed at a later date when accessed from the archive, and the values compared to an initial hash from when the data was first stored. This will identify if changes have occurred. In some cases, integrity can also be verified by loading backup data into a production-like environment to verify it is readable and conforms to expectation. This would be particularly appropriate for a cloud hot site where failover needs to happen quickly.

In addition to mandated retention periods, security practitioners must understand requirements for data formats mandated by their legal and regulatory obligations. For high-sensitivity data, particularly in the financial services industry, there may be a requirement for data to be stored immutably; that is, in a format where it cannot be changed. The integrity of the data is required to support investigations of regulated activity such as financial transaction, which will drive decisions on where and how to store the data. Once data is written, it will require adequate physical and environmental protections as well, to prevent theft, tampering, or degradation.

Write once, read many (WORM) media is one example of high-integrity media: data written to the media is physically unalterable, preventing a user from covering up evidence of fraudulent activity. Some cloud storage services implement similar write protections for integrity along with highly durable storage media; a CCSP should ensure proper storage solutions are chosen to meet the organization's need. Blockchain technology is also being used for verifiable integrity, as blockchains rely on hashing to defensibly prove data is legitimate. An organization's storage solution might be integrated with a blockchain ledger to prove data has not been modified since it was written as a way to prove the integrity of the data.

Legal Hold

Legal hold is a simple concept but has important implications for data retention. When data is placed under legal hold, its retention schedule is indefinitely suspended; if it should be retained for seven years but is placed under legal hold, it must be retained until the legal hold is lifted, even if the seven-year retention period passes. Determining legal hold is usually not within the purview of the CCSP, but they should be able to respond to such requests when the organization is involved in legal action such as a lawsuit.

The primary challenges surrounding legal hold include identifying applicable records to be held and implementing a way to exclude records from standard deletion procedures while under hold. Legal requests may often be vague or specifically broad, as they may be related to suspected wrongdoing and seek to obtain evidence; this leads to problems for data archivists and security practitioners when determining which records to place under hold. Internal legal counsel should always be consulted, and in general, it is better to retain more rather than less.

The second challenge of excluding records under legal hold from deletion is a data management problem. Hard-copy records under hold can be easily separated from other records by moving them to a secure facility and ignoring them when performing deletion, but electronic information systems are more complex. Copying records to a separate storage solution may be feasible but introduces challenges of preserving integrity during the process as well as the need to set up an access-controlled legal hold archive location. Another approach is the use of metadata to electronically flag records to exclude them from deletion. Many databases and filesystems support this functionality, but the archivist or security practitioner must also ensure supporting systems such as cryptographic key management are aware of records under legal hold. Otherwise, encrypted records may be retained because they are flagged, but the key needed to decrypt them may be deleted when its retention period expires.

DESIGN AND IMPLEMENT AUDITABILITY, TRACEABILITY, AND ACCOUNTABILITY OF DATA EVENTS

All the topics discussed thus far fall into broad categories of data governance and data management. Just like any other valuable asset, an organization must identify strategic goals and requirements for managing, using, and protecting data. Despite best efforts to safeguard these assets, there will be times when it is necessary to collect evidence and investigate activity that could support assertions of wrongdoing. Gathering this digital evidence requires the existence of relevant data in logs or other sources, as well as capabilities to identify, collect, preserve, and analyze the data. An organization with no preparation for these activities will find it difficult or impossible, and security practitioners should ensure their organizations are not put in situations where they are unable to investigate and hold bad actors accountable.

Definition of Event Sources and Requirement of Identity Attribution

There are formal definitions of events used in IT service management as well as security incident response. In general, an event is any observable action that occurs, such as a user logging in to a system or a new virtual server being deployed. Most IT systems support some ability to capture these events as running logs of past activity, though cloud-based systems can present a number of challenges that a CCSP should be aware of.

The primary concern regarding information event sources in cloud services is the accessibility of the data, which will vary by the cloud service model in use. IaaS will obviously offer the most data events as the consumer has access to detailed logs from network, OS, and application data sources, while in SaaS, the consumer may be limited to only data events related to their application front end with no access to infrastructure, OS, or network logs. This loss of control is one cost factor organizations should weigh against benefits of migrating to the cloud and may be partially offset with contract requirements for expanded data access in exceptional circumstances like a data breach.

The Open Web Application Security Project® (OWASP) publishes a series of guides called cheat sheets, including one on logging. The guide covers a variety of scenarios and best practices and also highlights challenges that a CCSP is likely to face when using cloud services, particularly the inability to control, monitor, or extract data from resources outside the organization's direct control. Some of these scenarios include the following:

  • Synchronize time across all servers and devices: The timestamp of events in a log is crucial to establish a chain of activity performed by a particular user. If a cloud service is hosted and generates logs in a time zone different from the user's workstation, it will require significant mental overhead to trace events.
  • Differing classification schemes: Different apps and platforms will categorize events with different metadata. For example, one OS may identify user logins as “Informational” events, while apps running on that OS log the same activity as “User Events.” Constructing queries on such varied data will be difficult.
  • Identity attribution: Ultimately logs should be able to answer the basic question “Who did what and when?” Sufficient user ID attribution needs to be easily identifiable; otherwise, it may prove impossible to definitely state that a particular user took a given action. The organization's identity and access management system may offer the ability to utilize a single identity across multiple services, but many cloud services enforce their own user management, complicating the attribution of events to a specific user.
  • Application-specific logs: Apps may offer the ability to generate logs, but unlike widely standardized tools like operating systems or web server software, they may log information in a unique format. SaaS platforms may offer even fewer configuration options as internal functionality such as event logging configuration is not exposed to the end users. Making sense of this data in relation to other sources, such as an app that uses a numeric ID value for users while others use an email address, can be challenging.
  • Integrity of log files: App and OS log files typically reside on the same hosts as the software generating the logs and so are susceptible to tampering by anyone with privileged access to the host. If a user has administrative permissions, they may be able to not only perform unauthorized actions but then cover up evidence by modifying or deleting log files.

The full Logging Cheat Sheet is available here: cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html.

Logging, Storage, and Analysis of Data Events

Logs are made valuable only by review; in other words, they are valuable only if the organization makes use of them to identify activity that is unauthorized or compromising. Due to the sheer volume of log data generated, it is unlikely a human being would be capable of performing log reviews for any computing environment more complex than a single application.

SIEM tools, which are covered in detail in Domain 5: Cloud Security Operations, can help to solve some of these problems by offering these key features:

  • Log centralization and aggregation: Rather than leaving log data scattered around the environment on various hosts, the SIEM platform can gather log data from operating systems, applications, network appliances, user workstations, etc., providing a single location to support investigations. This is often referred to as forwarding logs when looked at from the perspective of individual hosts sending log files to the SIEM.
  • Data integrity: The SIEM platform should be on a separate host with access permissions unique from the rest of the environment, preventing any single user from tampering with data. System administrators may be able to tamper with logs on the systems they are authorized to access but should be denied write access to log data stored on the SIEM platform.
  • Normalization: The same piece of data can often be formatted in a variety of ways, such as dates written YYYY/MM/DD, DD/MM/YYYY, or even the Unix format, which is a count of the number of seconds measured from a start time of January 1, 1970. SIEMs can normalize incoming data to ensure the same data is presented consistently; for example, all timestamps are converted to utilize UTC rather than the time zone of the originating system.
  • Automated or continuous monitoring: Sometimes referred to as correlation, SIEMs use algorithms to evaluate data and identify potential attacks or compromises. These can often be integrated with other security tools such as intrusion detection systems (IDS) and firewalls to correlate information, such as a large number of failed logins by users after the user visited a particular website. This is indicative that users may have fallen for a phish, and the attackers are now trying to use those credentials. Crucially, this monitoring can be automated and is performed continuously, cutting down the time to detect a compromise.
  • Alerting: SIEMs can automatically generate alerts such as emails or tickets when action is required based on analysis of incoming log data. Some also offer more advanced capabilities like IPS, which can take automated actions like suspending user accounts or blocking traffic to/from specific hosts if an attack is detected. Of course, this automated functionality will suffer from false positives, but it performs at much greater speed compared to a human being alerted to and investigating an incident.
  • Investigative monitoring: When manual investigation is required, the SIEM should provide support capabilities such as querying log files, generating reports of historical activity, or even providing concise evidence of particular events that support an assertion of attack or wrongdoing; for example, a particular user's activity can be definitively tracked from logging on to accessing sensitive data to performing unauthorized actions like copying the data and sending it outside the organization.

Chain of Custody and Nonrepudiation

Data that is collected in support of an investigation has unique requirements for integrity, particularly one in which civil or criminal prosecution is called for. It is vital to establish a chain of custody, or a defensible record of how evidence was handled and by whom, from its collection to its presentation as evidence. If data has been altered after it was collected, the defendant might make a case that they are innocent because the evidence has been altered to make them look guilty.

Chain of custody and evidence integrity do not imply that data has not changed since collection, but instead they provide convincing proof that it was not tampered with in a way that damages its reliability. For example, data on a user's workstation will physically change location when the workstation is collected from the user, and the data may be copied for forensic analysis. These actions, when performed by properly trained professionals, do not change the underlying facts being presented or the believability of evidence. However, if the workstation is left unattended on a desk after it is taken from the user, then the user can claim the data was altered in a way that incriminates them, and thus the evidence is no longer reliable.

Nonrepudiation can be a challenging term because it is proving a negative. Repudiating an action means denying responsibility for the given action, so nonrepudiation is a characteristic whereby you can definitely hold a particular user accountable for a particular action. In nontechnical terms, it can be difficult to hold anyone accountable for drinking the last of the office coffee from a shared coffee maker; everyone can repudiate the assertion that they finished the coffee without brewing more. If users are required to enter a PIN to unlock the coffee maker, then it is possible to defensibly prove, when the coffeepot is found to be empty, that the last person whose code was used is the culprit.

Information systems enforce nonrepudiation partially through the inclusion of sufficient evidence in log files, including unique user identification and timestamps. System architecture and limitations can pose challenges such as shared user accounts that are tied to a group of users but no single user, as well as processes that act on behalf of users, thereby obscuring their identity. Nonrepudiation is a concern not just for data event logging but also access control and system architecture. Security practitioners must ensure their access control, logging, and monitoring functions support nonrepudiation.

One final note on chain of custody and nonrepudiation: the process of investigating digital crimes can be quite complex and may best be left to trained professionals with appropriate skills and experience. Simple acts like plugging in a drive can cause irreversible data loss or corruption, and destroying evidence may limit the organization's ability to seek legal redress if a crime has occurred. While CCSPs should be familiar with these practices, it is also advisable to know when expert professionals are required.

SUMMARY

Cloud services offer many benefits for accessing, managing, and handling the data that are crucial to modern business operations, but they are not free from risk. The cloud data lifecycle provides a convenient framework for identifying the types of activities, risks, and appropriate security controls required to ensure data remains secure. This, coupled with an understanding of the storage architectures available in different cloud service models, allows a CCSP to design and apply appropriate safeguards such as data discovery, encryption, and tokenization, as well as countermeasures like DLP. Proper cloud data security requires organizations to know what kind of data they handle and where it is stored, and deploy adequate policies, procedures, and technical controls to ensure the business benefits of cloud environments are not offset by increased information-security risks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.127.232