Safeguarded Copy introduction and concepts
In this chapter, we introduce the Safeguarded Copy technology and its concepts. First we justify the need for logical corruption protection (LCP) and provide information about regulatory requirements. We explain the general concepts of LCP, and the use cases for recovery.
Then we focus on the new Safeguarded Copy (SGC) solution that the IBM DS8000( starting with the DS8880) storage system provides to implement logical corruption protection for critical data. We list the design objectives and discuss how it fits into the overall data protection solution portfolio of the DS8000. We explain how it works, introduce some new terminology, and describe the backup and recovery processes in general. We finish with some considerations about integration into existing high availability/disaster recovery (HA/DR) solutions, backup isolation, and configuration changes.
This chapter covers the following topics:
 – What is logical corruption?
 – Regulatory requirements
 – Use cases for data protection
 – Requirements for logical corruption protection
 – Previous IBM DS80000 logical corruption protection
 – Objectives for Safeguarded Copy
 – Safeguarded Copy basic operations
 – Managing backups
 – Safeguarded Backup Capacity
 – Safeguarded Copy Backup
 – Safeguarded Copy recovery
 – Further considerations
1.1 The need to protect data
For many years, and with exponential growth, data has become one of the most important assets for most companies across all industries. Organizations are hugely affected if their data is lost or compromised. Data now is as important as any natural resource, which means that efficiently protecting data is essential for many businesses.
1.1.1 What is logical corruption?
A protection against all forms of corruption becomes more and more important, because besides hardware or software failures, corruption can be caused by inadvertent user error, malicious intent, or cyber-attacks, if hosted on systems connected to the internet.
Recent incidents show that cyber-attacks are rapidly growing in number and sophistication. Every few months, there are headlines in the news or on the internet about attacks on enterprise data from ransomware, malware, insider threats, or other destruction of data. Two notable examples are the WannaCry ransomware attack that struck over 100 countries within 48 hours, and the Petya cyber-attack that spread rapidly and impacted systems in 65 countries. This is a significant concern for many organizations, and to the associated industry regulatory bodies.
Over the last decades, most organizations have concentrated on developing and implementing high availability (HA) and disaster recovery (DR) solutions to protect their enterprise data against hardware and software failures, or data center outages. Today, companies become increasingly concerned about accidental or intentional logical corruption.
In this context, logical means that the all-hardware components are working as expected, but data becomes destroyed or corrupted on a content level. This form of corruption can range from deletion, to encryption, to selective manipulation.
Logical corruption cannot be prevented with traditional HA/DR solutions, which are not content-aware. In fact, continuous replication solutions, such as DS8000 Metro Mirror or Global Mirror, which are often used for DR, would quickly propagate any content level corruption to all copies, because for the storage system, it would just be another I/O.
We need a paradigm shift from a pure availability mind-set to cyber resilience (CR). Cyber resilience aims at the ability of an organization to continue to operate with the least amount of disruption despite cyber attacks and outages. Cyber resilience expands the scope of protection, covering both cyber security and business continuity. A significant part of cyber resilience is the ability to recover from a logical data corruption event.
1.1.2 Regulatory requirements
For some industries, such as finance or health care, data must be protected in ways that conform to more and more drastic regulations.
The US Federal Financial Institutions Examination Council (FFIEC), for example, published a revised Business Continuity Planning Booklet, which is part of the FFIEC’s Information Technology Examination Handbook for the US financial industry.
In Appendix J, the FFIEC writes the following guidelines:
“The financial institution should take steps to ensure that replicated backup data cannot be destroyed or corrupted in an attack on production data.”1
“...air-gapped data backup architecture limits exposure to a cyber-attack and allows for restoration of data to a point in time before the attack began.”1
Similar statements are made by the US National Association of Insurance Commissioners (NAIC) and from the European Banking Authority (EBA).
The NAIC states:
... It is vital for state insurance regulators to provide effective cyber-security guidance regarding the protection of the insurance sector’s data security and infrastructure.2
In 2017, the EBA wrote in the Guidelines on ICT Risk Assessment under the Supervisory Review and Evaluation process (SREP):
… solutions to protect critical internet activities or services (e.g. e-banking services), where necessary and appropriate, against denial of service and other cyber-attacks from the internet, aimed at preventing or disturbing access to these activities and services.3
For organizations from affected industries, such statements increase the demand to implement solutions that protect against logical corruption. The IT industry is sought after to help their clients to design and implement solutions that meet these requirements.
1.1.3 Use cases for data protection
A logical corruption protection solution must be able to cover different uses cases if a logical corruption has occurred, because the corruption event could be triggered by a wide range of causes: from application corruption to user error through inadvertent or malicious destruction of data, and to ransomware attacks where data might be encrypted by an attacker.
Typical use cases for protection copies include the following options:
Validation
Regular data analysis enables early detection of a problem or reassurance that a given protection copy is uncorrupted. Performing corruption detection and data validation processes against a copy of data might be more practical than doing this in the live production environment.
Forensic analysis
If a data corruption is detected, but systems are still operational, the first step is a forensic analysis. You determine what data is corrupted, when the corruption occurred, and which of the available protection copies is the last good one. You also decide whether you can fix the corruption from within the production environment or whether one of the following recovery methods is required.
Surgical recovery
You perform surgical recovery if you only recover certain parts of the production data from a backup copy. It can be a fast and safe method if only a small portion of the production data is corrupted, and if consistency between current production data and the restored parts can be re-established. Another case for this kind of recovery might occur if the last known good backup copy is too old to restore the complete environment. It might then be desirable to leave most of the production volumes in their present state, and just copy replacement data to correct actually corrupted data.
Catastrophic recovery
If the corruption is extensive, or if the latest known good protection copy is current enough, the easiest way may be to restore the entire environment to a point in time that is known to be unaffected by the corruption.
Offline backup
Performing an offline backup of data from a consistent point-in-time copy can be used to build a second line of defense, providing a greater retention period and increased isolation and security.
 
Note: Because the offline backup use case is similar to regular backup methods, it is out of the scope of this book.
1.1.4 Requirements for logical corruption protection
As already stated, traditional HA/DR solutions cannot provide complete protection against content-level destruction of data. Different approaches to data protection are required to provide this kind of protection. The major design requirements for logical corruption protection include the following characteristics:
Granularity: we must be able to create many protection copies in order to minimize data loss in case of a corruption incident
Isolation: the protection copies must be isolated from the active production data so that it cannot be corrupted by a compromised host system (this is also known as air gap)
Immutability: the copies must be protected against unauthorized manipulation.
1.2 IBM DS8000 Safeguarded Copy
The DS8000 family of data storage systems provides a wide range of data protection capabilities, mostly based on the proven IBM FlashCopy® and Peer to Peer Remote Copy (PPRC) technologies. They were, however, not designed for today’s logical corruption protection demands. The new Safeguarded Copy functionality is introduced to fill this gap.
1.2.1 Previous IBM DS8000 logical corruption protection capabilities
The DS8000 family of storage system provided capabilities for logical corruption protection even before the introduction of Safeguarded Copy. In most cases where clients created LCP solutions, the FlashCopy technology was used to provide the protection copies.
Figure 1-1 shows a model configuration that can be used to explain the previous concepts of DS8000 Logical Corruption Protection.
Figure 1-1 Model configuration for DS8000 logical corruption protection
The active production data that needs to be protected resides on volumes H1. Multiple FlashCopies (F1a, F1b, ...) can be created for H1 to provide the recovery points. The following actions can be accomplished for each of the recovery points:
Restore to the original volumes H1 by using the FlashCopy reverse function.
Recover to recovery volumes R1 by creating a new, forward cascaded, FlashCopy relation.
 
Important: Theoretically, the FlashCopy targets that provide the recovery point, could also be used directly by a host system. This, however, would potentially contradict the goal to protect the security copies from being corrupted themselves.
Restoring to the original volume H1 provides a fast way to restore to a certain point in time. In many cases, however, a more differentiated approach is necessary:
You need to determine which part and how much of your data became corrupted.
You have to find out which point in time is the latest one that is not corrupted yet.
You might have to develop a strategy to merge as much as possible of the uncorrupted data that is actual, with the older backup that replaces the corrupted part, while maintaining consistency.
For these analytical steps, a recovery volume and an independent recovery system become useful. It allows a differentiated approach to minimize the impact of the recovery (see “Use cases for data protection” on page 3 for more detail):
Validation Use the recovery volumes to perform regular analysis on the copies to provide early detection of a problem or reassurance that the copy is a good copy prior to further action.
Forensic analysis Use the recovery volumes to investigate the problem and determine what the recovery action is.
Surgical recovery Extract data from the copy on the recovery volumes and logically restore back to the production environment.
Catastrophic recovery Recover the entire environment to the point in time of the backup.
 
Tip: Starting with release 4.1, IBM Geographically Dispersed Parallel Sysplex™ (IBM GDPS) provides an integrated logical corruption protection function that is based on FlashCopy, using the concept provided above.
Because FlashCopy was originally designed with other objectives than logical corruption protection, it has some characteristics that do not fit ideally for this purpose:
It is limited to 12 relations per source volume. This allows for a maximum of 10 targets (point-in-time copies), if a concept (as shown in Figure 1-1 on page 5) is used.
FlashCopy targets are regular volumes. Therefore they can be accessed by hosts and potentially be modified. They also use DS8000 device numbers, and therefore reduce the number of possible host volumes.
FlashCopy relations and targets can be deleted by the DS8000 administrator.
Source volumes with multiple targets can suffer from a write performance impact, because all targets are maintained individually.
To overcome those limitations, a new feature called Safeguarded Copy (SGC) was introduced with IBM DS8880 release 8.5, specifically targeted at the requirements for logical corruption protection.
1.2.2 Objectives for Safeguarded Copy
Safeguarded Copy has the following objectives:
Allow creation of many recovery copies across multiple volumes or storage systems with optimized capacity usage and minimum performance impact (the current limit is 500).
Enable any previous recovery point to be made available on a set of recovery volumes while the production environment continues to run.
Secure the data for the Safeguarded Copies to prevent it from being accidentally or deliberately compromised.
Don’t use DS8000 device numbers and host device addresses (UCBs in mainframe environments).
Safeguarded Copy does not replace FlashCopy, and both technologies remain relevant in Logical Corruption Protection scenarios. FlashCopy provides an instantly accessible copy of a production volume or, and for multiple FlashCopies, each copy is independent from the others from a data perspective.
1.2.3 Safeguarded Copy basic operations
Safeguarded Copy provides functionality to create multiple recovery points for a production volume (in this publication also referred to as Safeguarded source or just source). These recovery points are called Safeguarded Backups (also referred to as SG backups, or simply backups).
Unlike with FlashCopy, the recovery data is not stored in separate regular volumes, but in a storage space that is called Safeguarded Backup Capacity (SGBC). The backups are not directly accessible by a host. The data can only be used after a backup is recovered to a separate recovery volume.
Figure 1-2 illustrates the basic relations within a Safeguarded Backup configuration.
Figure 1-2 Safeguarded Copy basic relations
When a point in time has been restored to the recovery volume, you can access it using your recovery system. This system might or might not be identical to your production system, depending on your security requirements. If you want to restore the data from the recovery to a production volume, you can use Global Copy. The production volume can be located in the same, or in a different, DS8000 than the recovery volume.
A production environment can consist of hundreds or thousand of volumes across one or more storage systems. One of the most important aspects for logical corruption protection is to provide recovery points that are consistent across all volumes that are part of a backup. We call such a recovery point a consistency group (CG).
 
Note: Because we create consistency within the storage systems, which are not content-aware, the maximum level we can achieve is crash consistency. This means that in case of a recovery, you might have to perform operating system or application recovery.
1.2.4 Managing backups
Safeguarded Copy Backups must be protected against unintentional or intentional tampering. Therefore you cannot create, delete, or recover manually using the DS8000 management interfaces. You need an instance of the IBM Copy Services Manager (CSM) to perform these tasks. See “Hardware and software prerequisites” on page 16 for details about the management software requirements.
CSM uses a session concept to manage complete consistency groups. Figure 1-3 shows the basic constructs of a CSM Safeguarded Copy session.
Figure 1-3 A Safeguarded Copy session in CSM
Each session consists of a number of copy sets. There is one copy set for each production volume that you want to backup. Each copy set consists of the production or source volume itself, with its associated Safeguarded Backup Capacity, and a recovery volume. CSM performs actions, like backup and recovery, always on the complete session. You can have CSM running these operations automatically using the built-in scheduler.
CSM also manages the lifecycle of your backups. You can specify the retention period for backups, and CSM automatically expires (removes) those that are not required anymore. This makes the management easy and secure, and also ensures that consistency is maintained across the session. The various session operations are explained in detail in Chapter 3, “Safeguarded Copy implementation and management” on page 39.
Separation of the DS8000 administrator, who is able to change the logical configuration, and a backup administrator, who manages the backups with CSM, can help to further improve security.
1.2.5 Safeguarded Backup Capacity
Safeguarded Backup Capacity is always thin provisioned. It is a preferred practice to use small extents for best efficiency. Without any existing Safeguarded Backups, the Safeguarded Backup Capacity is pure virtual capacity that is associated with the source volume. Physical storage space is allocated as you create backups, and data that is overwritten in the original volume is saved in the backup. Backup data is saved with track granularity (64 KiB for fixed block, about 55.3 KiB for CKD storage), which leads to a better efficiency than with FlashCopy.
 
Note: Safeguarded Backup Capacity is allocated in extents, but data is stored with track granularity. At least one extent is allocated for each Safeguarded Backup.
You specify the maximum amount of Safeguarded Backup Capacity for each volume that you want to back up. When it has been reached, the oldest backups are removed automatically to free up space. As long as any Safeguarded Backups exist for a given volume, you cannot delete its associated Safeguarded Backup Capacity. If a storage pool runs short on physical space, regardless of whether it is for backup or production data, the DS8000 sends notifications according to the pool settings. See “Safeguarded Backup Capacity sizing” on page 17 for more detail about capacity management for Safeguarded Copy.
Extents that are allocated as Safeguarded Backup Capacity are not monitored by IBM Easy Tier®. After being written, Safeguarded Backup data normally is not accessed at all. Safeguarded Copy Backup
When you initiate a Safeguarded Copy Backup, the DS8000 creates a consistency group. It sets up metadata and bitmaps in order to track updates to the production volume. After the backup is set up, the DS8000 copies data, that is to be overwritten by host I/O, from the production volume to a consistency group log within the Safeguarded Backup Capacity.
When you initiate the next backup, the DS8000 closes the previous one and creates a new consistency group. Therefore, it doesn’t have to maintain each backup individually. To restore to a certain recovery point, the DS8000 needs all backups that are younger than the one to recover. Figure 1-4 illustrates the way Safeguarded Copy saves backup data.
Figure 1-4 Safeguarded Copy backups using consistency group logs
To minimize the impact of creating a consistency group, the Safeguarded Copy Backup process consist of 3 steps:
1. Reservation: In this step, the DS8000 prepares to create a new Safeguarded Backup. It sets up the required bitmaps and prepares the metadata in the Safeguarded Copy Capacity. It also makes sure, that all changed data from the previous backup is stored in its consistency group log. After all preparations are done, the actual consistency group formation can take place.
2. Check in: In order to create a consistency group, the DS8000 has to stop all updates for all volumes within the CG for a short period. It does this by presenting an Extended Long Busy (ELB) state. When the data in cache is consistent, the previous consistency group logs of all affected volumes are closed, and therefore are also consistent. From now on, the DS8000 will write further backup data into the consistency group logs of the new backup.
3. Completion: The DS8000 lifts the ELB and write operations can continue.
The IBM Copy Services Manager coordinates and performs these steps automatically and with minimum impact to the host operations.
 
Note: Expect the impact of the ELB on host writes to be about the same (or less) as if using FlashCopy with the consistency group option.
1.2.6 Safeguarded Copy recovery
You can recover any recovery point to a separate recovery volume. This volume must have at least the same capacity as the production volume, and can be thin provisioned. You can perform the recovery with background copy or without. Typically, you specify nocopy if you need the recovered data only for a limited period of time, and copy if you intend to use it for longer. You can initiate a Safeguarded Copy recovery exclusively through CSM.
We use an example, illustrated in Figure 1-5, to explain how the recovery process works.
Figure 1-5 Example to explain Safeguarded Copy recovery
We have a Safeguarded Copy configuration with the production volume H1, the recovery volume R1, and four Safeguarded Backups (t1 - t4, with t4 being the most recent one) representing four recovery points. We want to recover to point in time t2, with the nocopy option. The recovery consists of two steps and is illustrated in Figure 1-6.
1. The DS8000 establishes a FlashCopy from H1 to R1. This makes R1 identical to H1.
2. It then creates a recovery bitmap that indicates all data that was changed since t2 and must be referenced from the consistency group logs t4, t3, and t2, rather than from H1.
Figure 1-6 Safeguarded Copy recovery data flow
From this point you can do read and write access to R1. If the recovery system reads data from R1, the DS8000 examines the recovery bitmap and decides whether it must fetch the requested data from H1 or from one of the consistency group logs. In case the same track shows up in more than one backup, it has to use the “oldest” instance (the one closest to recovery point t2). If the recovery system writes to R1, we distinguish two cases:
Full track write: the DS8000 can write directly to R1 without considering existing data
Partial track write: the DS8000 has to fetch existing data first, according to the rules above, and the can apply the update.
If you perform the recovery with background copy, the DS8000 copies all data from H1 and the consistency groups to R1 in the background, following the same rules. You can access R1 at any time while the background copy is still running.
 
Note: You can continue to create new backups, even while a recovery is active and in use.
In cases where you have to restore data back to the original production volume H1, you have several choices:
Full volume restore: you can also use Global Copy to replicate the data from the recovery volume to a production volume. The production volume can be on the same or a different DS8000 as the recovery volume.
Selective restore: you either have to make the production volume available to the recovery system, or the recovery volume to the production system. Then you can use standard operating system or application methods to copy the data you need from the recovery to the production volume.
1.2.7 Further considerations
The following sections describe further considerations.
Safeguarded Copy and existing HA/DR solutions
You can combine Safeguarded Copy with any supported HA/DR solution, be it 2-, 3-, or 4-site, with or without IBM HyperSwap®. However, there are some important factors:
Consistency: Whenever you want to create a new Safeguarded Copy Backup, you have to make sure that the set of original volumes is consistent. This might require additional steps you must perform before you can create the actual backup.
Management: You can continue to use your existing management solution for your HA/DR configuration, even though you must use IBM Copy Services Manager to manage the Safeguarded Copy configuration and operations. If you have an HA/DR management solution other than CSM, you must make sure that operations do not become mixed up.
See “DR, HA, and Safeguarded Copy topologies” on page 31 for more details and examples of combining HA / DR with Safeguarded Copy.
Virtual and physical isolation
Just as we have a range of different topologies for high availability and disaster recovery, depending on the protection required, we also can consider different topologies for cyber resilience solutions. The first decision for many organizations is whether they create an environment with physical isolation from production for their protection copies, or whether virtual isolation on existing storage systems is considered sufficient.
For virtual isolation, you create the protection copies on one or more of the storage systems in your existing high availability and disaster recovery topology. The example in Figure 1-7 shows synchronous replication being used for HA and DR with the protection copies being created on one of the production storage systems.
Figure 1-7 Example for Safeguarded Copy with virtual isolation
For physical isolation, you need additional storage systems for the protection copies. These storage systems are typically not on the same SAN or IP network as the production environment. The systems have restricted access, perhaps even with different administrators to provide separation of duties. Figure 1-8 shows an example of such an environment.
Figure 1-8 Example for Safeguarded Copy with physical isolation
We use the same HA / DR configuration as in the previous example. The protection copies, however, are placed outside of this configuration. Data is replicated to the isolated storage system using Global Copy.
 
Note: In order to create consistent backups, H3 must be consistent. The Global Copy secondary H3 in our example is an inconsistent copy by design. Every time you create a Safeguarded Backup from H3, you have to make it consistent. In this example, this means to freeze and suspend Metro Mirror between H1 and H2, synchronize Global Copy, and then perform the backup. You can automate such a sequence with the CSM scheduler.
Managing configuration changes
You have to take considerations when you change the Safeguarded Copy configuration by adding or removing volumes from the backup set. As all Safeguarded Copy operations, this is done at the CSM session level, by adding or removing copy sets. See 2.6, “Configuration changes considerations” on page 35 and 3.2.5, “Other Copy Services Manager operations related to Safeguarded Copy” on page 64 for a more detailed description of configuration changes.
Adding copy sets
If you add volumes to your production environment, for example because you need more capacity, you can add those new volumes to the Safeguarded Copy configuration by adding copy sets to the Safeguarded Copy CSM session. From this point in time, your Safeguarded Copy consistency groups will be larger than before. However, the already existing backups still consist of fewer volumes than the ones that you create after the configuration change. If you have to recover to an older backup, the content of the newly added volumes is invalid.
By adding new production volumes, you also change the hardware configuration of your production environment so that the newly added volumes are recognized and can be used. If you have to recover your Safeguarded Copy environment to a point in time where the new volumes contain invalid data, you have to reverse the hardware configuration changes in such a way that these volumes are not accessed anymore.
 
Note: If you add volumes to an IBM z/OS® environment, you make these new volumes known to the systems by adding them to the input/output definition file (IODF). If you need to recover to a backup that does not contain the new volumes, a possible way to avoid these volumes being accessed is to undo the IODF changes.
Removing copy sets
There may be reasons that require you to reduce the number of volumes in your Safeguarded Copy configuration. You can do this by removing copy sets from the Safeguarded Copy CSM session. If you remove copy sets from the session, they will be removed completely, and if existing backups still depend on these volumes, they may be incomplete and not usable.
 
Important: It is a preferred practice to wait with the removal of copy sets until all Safeguarded Copy Backups that rely on them are expired.
 

1 FFIEC “Business Continuity Planning Booklet Appendix J” https://www.ffiec.gov/press/pdf/ffiec_appendix_j.pdf
2 NAIC “Principles for Effective Cybersecurity: Insurance Regulatory Guidance” http://www.naic.org/documents/committees_ex_cybersecurity_tf_final_principles_for_cybersecurity_guidance.pdf
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.152.38