Backup and Recovery Solutions

The basics of any high-availability configuration, regardless of its uptime commitments and availability ratings, include a solid backup and recovery solution. More than any other precaution, being able to recover data from a loss is critical to the survival of a business. Recovering from a disaster is of even greater importance than automated failover solutions. Very few companies survive long-term if unprepared when a real disaster strikes. For this reason, the backup and recovery concepts are presented first in this chapter.

There are several ways to store data for recovery purposes. They all involve making a copy of the production SAP database server's disk volumes, and they all complement each other:

  • Back up data to and recover data from tape (CD or disks), which SAP recommends as a standard administrative practice.

  • Mirror the database server's disk volumes on a separate disk volume, either with soft- ware tools or with the use of enterprise storage systems (data replication with point-in-time copies).

  • Use a separate database recovery server, preferably offsite, with log shipping (also referred to as shadow-database server).

The first point of the traditional backup and restore solutions typically used is presented in the next section. The second point about data replication features within enterprise storage systems was presented in Chapter 4. However, making remote copies is presented in the failover solutions section in this chapter. The third point about using a standby recovery server is also part of the failover section in this chapter.

Backup and Restore Configurations

There are several configuration options for backups and restoration. These include local backups and backups with a dedicated backup server. The most important server to back up is the SAP database server.

The easiest way to configure a backup solution is to perform a local backup with the tape devices attached directly to the same server with the disk storage system attached. When using a dedicated backup server, it can be attached to an enterprise storage system or be made available across a separate high-speed backup network. An alternative local backup configuration takes advantage of the new Storage Area Network (SAN) environment.

Local Backup

The simplest backup and restore solution is to directly attach tape devices (single devices, tape libraries, or tape racks) to the SCSI or Fibre Channel cards within each server, as shown in Figure 5-1. In this case, the backup software manages the data flow from the disks, through the server's memory subsystem, out to the I/O channels to which the tape devices are attached. A local backup is the easiest configuration to restore.

Figure 5-1. Local Backup Configurations


A local backup can also be configured in a high-availability cluster scenario. In this case, each server can back up its disk volumes to a shared tape backup system, whether a library or simply a tape-rack. The backup job is typically configured to run on only one of the cluster nodes at a time, although if each server is directly attached to its own tape devices in a tape library, then the backup jobs can be run in parallel. Backup software, such as HP's OmniBack II or Veritas' NetBackup, can share the tape devices among the cluster nodes, integrate with the clustering software, and handle failover events. More about clustering is described in the second half of this chapter.

Impact of Backup Jobs

A local backup job can be performed either online while the database is still running, or offline, with the database shut down. Performing a full offline backup allows for a faster recovery because only one backup set needs to be restored. A full offline backup also means application downtime, so alternatives are needed. One alternative solution is called Zero Downtime Backup, discussed later, and the remaining alternative is to make online backups.

In a 24×7 operations environment for mySAP.com, no timeframe is available for full offline backups. Thus performing online backups can help keep the application available, but this has a few implications:

  • The SAP system performance is impacted due to the backup running on the database server;

  • Restores take longer depending on how much changed since the beginning of the backup; and

  • The status of the database data files or tablespaces is usually changed to read-only while their backup takes place. If a failover or restart of the database occurs during this time, the database must be brought up in recovery mode either manually or by the failover software.

How much the system performance is impacted depends on the disk and tape I/O configuration, as well as on the utilization of the system for online transaction-processing.

TIP

Online Backups—Server Configuration

One way to reduce the impact of online backup jobs is to increase the amount of memory in the server and to consider reserving one CPU for each streaming tape device in addition to those CPUs needed for the concurrent transaction-processing load.


Application Servers and Backup/Restore

Although the SAP application servers also need to be backed up, this is only required when configuration changes are made to their boot disks (new kernel patches, network and environment settings, registry changes, etc.). The application servers can either be backed up or restored via a network backup configuration, or a local DAT/DLT tape device can be used on each. The local tape solution is the easiest to administrate for a restore because no network environment must first be in place. In addition, the local tape can support the One-Button-Disaster-Recovery feature (explained later), making recovery simple.

Local Backup with Backup Server

The advantage of a local backup is in its speed of backup and restore, as well as its simplicity. However, the investment in a backup solution may require a large tape library, which is typically leveraged across many database servers in order to be cost effective. In this scenario, the backup server with the tape library is attached to the same disk system as the SAP database servers. The backup server does not run any SAP software because it is dedicated to managing the backup and restore jobs (see Figure 5-2).

Figure 5-2. Local Backup with a Backup Server


For this solution to work, however, the backup server must get access to (mount) the disk volumes to be backed up. One method is to unmount the disk volumes from the database server and allow them to be mounted by the backup server during the backup job. However, this means SAP system downtime for the entire period of the backup job.

Zero Downtime Backup

An alternative method is to use a business copy or point-in-time copy volume that can be split off for a backup, and remirrored after the backup is finished. This solution is used with online backups in high-end, 24 x 7 mission-critical configurations with constant use of the database. It is also possible to use this solution for making offline backups.

For an online backup job, the status of the data files or tablespaces residing on the disk volume(s) to be split is set to read-only mode. Then the storage system's cache is flushed to synchronize the disks for a consistent point-in-time copy (a few moments of sequential I/O). Once both the production and business copy volumes are physically identical, the split is made. After the few seconds it takes to make the split, the data files or tablespaces on production disk volumes are set back to read-write status and the database resumes normal activity. Transactions that happened during this short time are recorded in separate log files, which also must be backed up because they are needed for recovery. The database is online the entire time, therefore users experience zero downtime. Making offline backups using this process is also possible, but it requires a graceful shutdown of the database, which is no longer zero downtime or zero impact.

A dedicated backup server can perform the backup job once the split is made. At the conclusion of the backup job, the split copy volume is merged with the production disk volume(s) for resynchronization. How long this takes depends on how much data was changed since the last split, as well as on how fast the disks and internal data buses are. This could be a matter of minutes to an hour or two. Most of the large enterprise storage systems support the split mirror feature with their primary disk volumes (P-Vol) and business copy disk volumes (point-in-time copies, etc.).

Various backup software packages support this environment, including HP OpenView OmniBack II and Veritas' NetBackup for HP SureStore E Disk Array XP series or EMC Symmetrix. These backup solutions integrate into SAP's BACKINT interface and provide the zero- impact online backup feature described previously with script control of the database and the disk volume split. Solutions that integrate with the SAP R/3 software, such as its BACKINT interface, require certification from SAP.

The main advantage of this solution is that there is no significant performance impact on the SAP system or significant downtime. In addition, this solution is fully automated because the backup software manages the complete cycle with the use of control scripts. It is the ideal backup and recovery solution for true 24 x 7 mission-critical operation environments.

Network Backup

An additional way to leverage the investment in a high-end backup solution across multiple database servers is to perform the backups via the network. The benefits of a network backup solution include centralized management, shared use of the tape library, support for heterogeneous database servers, and potential leverage of the existing network infrastructure.

Because of the high load on the network, however, it is recommended to use a separate backup network, preferably at Gigabit speeds. Otherwise, the backups and restores may take longer due to the network bottleneck. In Figure 5-3, only the separate backup network is shown. The additional server network for the application servers, the public network for the client PCs, and the cluster network are purposely left out for simplicity. This does mean, however, that quite a bit of network I/O cards are needed in each server, especially if redundant links are used. In addition, the OS and backup software must be properly configured and maintained on each server to send data over the correct network, resulting in increased complexity.

Figure 5-3. Dedicated Backup Network Configuration


There is a performance impact on the database server being backed up because it must retrieve data from its disks and send it out across the network. This requires planning the backup jobs during lower processing periods. In cases where there is no dedicated network for backups (i.e., the client/server network between the SAP database and application servers is used instead), there will be a performance impact on the entire SAP system. This is especially a problem if a restore happens during online business hours.

SAN Backup

The main benefit of a network backup solution includes a centralized backup process that shares the use of the tape library and can reside in a heterogeneous environment. With the emergence of storage area networks (SANs), these benefits can also be realized without the network configuration complexity. SANs also enable virtual tape servers, or serverless backup devices.

Figure 5-4 shows a common disk and tape SAN configuration. Because SANs are still a maturing technology, some incompatibilities may exist, however, so only integrated and tested solutions should be considered. Specific OS releases and drivers are needed, along with very specific types of FC adapters and FC switches for SAN Fabric Logon. Thus, the configuration shown in Figure 5-4 may not apply across all server and storage systems and may have limited OS support. The figure shows the traditional tape library, which requires a backup server or backup process on each server, plus a virtual tape server, which does not require a separate backup server.

Figure 5-4. Common Tape and Disk SAN Backup Configuration


A simple way to avoid the compatibility issues with a common tape and disk SAN is to use a SAN dedicated to tape devices. This still has the benefit of eliminating the network configuration effort, but it does require a separate SAN infrastructure. A dedicated SAN for tape backups in a heterogeneous environment is generally available at this time.

Influences on Backup and Restore Time

No matter what backup and recovery solution is used, when an SAP system needs to be recovered, the time it takes to do so becomes critical to supporting the business operation's continuity.

Backing up data to tape can be performed usually either online or during noncritical business hours, or with a zero downtime backup solution (discussed earlier). In comparison, recovering data must usually happen much more quickly after a failure event, because users can't work with the system or the database until it is recovered. Thus the recovery time is the most critical and is influenced by the performance of all the components in a solution.

The components in the backup and restore solution include the tape devices, the I/O channels, the disk mechanisms, the network (if used), and the server's system bus. Backup and restore times are usually measured in minutes or hours. However, the components' individual performance is usually measured in MB per second.

Tape Devices

The tape devices are at the heart of most backup solutions. Ideally, the backup or restore solution should be designed so that the tape devices are the lowest common performer in the chain of components. Table 5-4 lists some of the common backup devices and their native data transfer rates (i.e., no compression).

Table 5-4. Restore Transfer Rates of Common Backup Devices
Tape Devices Capacity—GB (native) MB/s (native)[a] GB/hr[b]
DAT8 / DDS-2 4 0.5 1.8
DAT24 / DDS-3 12 1.0 3.6
DAT40 / DDS-4 20 3.0 10.8
DLT40 or 4000 20 1.5 5.4
DLT70 or 7000 35 5.0 18
DLT80 or 8000 40 6.0 21.6
StorageTek 9840 20 10 36
SDLT 10 36
Linear Tape Open (1st Gen.) 00 15 54
Optical Disk (R/W) 1.3 0.8 2.9

[a] Estimated values.

[b] One Gigabyte = 1000 Megabytes.

Parallel Tape Backup

Multiple tape devices can be combined in a tape rack or tape library (robot) solution. If the data can be backed up in parallel from independent data sources, then the overall transfer rate will increase linearly by the number of tape devices in the tape rack or library. The same applies to the restore. For example, an SAP database server usually has multiple logical disk volumes to store the various database tablespaces or files. Each one can be treated as an independent data source for parallel backup purposes. If a tape library is used with four DLT8000 devices, then the combined native transfer rate is 24MB/s (6MB/s x 4 devices— assuming the disk system can keep up).

Redundant Array of Independent Tapes—RAIT

RAIT is essentially the disk RAID technology applied to tapes to increase throughput speeds and to provide redundancy. RAIT 1 is mirrored tape devices and is recommended for mission-critical SAP database servers. Having two identical tape sets allows one to be stored securely offsite while the other is kept locally for quick recovery purposes, which is important for disaster recovery.

RAIT 5 employs a distributed parity scheme, just like RAID 5 for disks. However, all of the tapes are necessary for the complete volume. Although there are some throughput and parity protection benefits over non-RAIT tape sets, using RAID 5 does not allow one set of tapes to be removed for offsite storage, making its usage limited in mission-critical environments.

Data Compression

If the database data files contain a lot of unused but allocated space, initialized with zeros, then it is possible to compress this during the backup process, allowing for higher transfer rates and larger capacities. For example, a 100GB database may only contain 50GB of actual used space. Tape vendors often quote their tape transfer rates and maximum capacities at 2:1 compression ratios. However, as the SAP database fills up, less compressible space is available. Because the backup or restore timeframes don't usually change and are critical to maintain within, it is recommended to assume a compression ratio of 1:1.4 or below for SAP database servers.

One Button Disaster Recovery

The normal method of building up a system from scratch, such as after a disaster, requires several steps. First, the hardware must be secured and configured. Next, the disaster recovery (DR) operating system must be installed from installation CDs, the basis for the restore of the predisaster OS configuration. Then the database can be recovered, depending on the availability of logs and offline or online tape backups. This doesn't include any special steps needed to rebuild a local cluster. In any case, it takes a long time to fully recover up to the point in time of the disaster and even longer to recover the transactions that happened during the time recovery took place. The minimum downtime time is typically two days for a relatively quick recovery, weeks if things go wrong. Sometimes, a full recovery is not possible.

To accelerate the restore of the OS, Hewlett-Packard's One-Button-Disaster-Recovery feature can be considered. This allows the installation from CDs of the DR operating system step to be skipped in the recovery process. The tape mechanisms from HP have a special firmware built in that allow them to emulate a bootable CD-ROM drive if the correct button on the tape mechanism is pushed. This allows a quick restore of the boot disk. The code loaded into memory is sufficient to boot the computer and to permit loading of the operating system and restore software, so that a full restoration can take place. The server boots from a virtual CD image, written by backup software aware of this feature, which contains bootable code to reload the original operating system, including drivers for the main peripherals.

This OBDR feature was created primarily for Intel-based PCs that have BIOS support for bootable CD-ROM drives (Unix servers and workstations have had the capability to boot from tape devices for years). For Intel-based SAP application servers, it makes sense to use this feature. Specifically for the database servers, this can be used for the OS boot disk while another backup solution is used for restoring the data files.

Disk Mechanisms and SCSI Channels

For a fast backup or restore, the tapes must be kept streaming (continuously transferring data). However, the disks and SCSI channels must be able to keep up with the tape transfer requirements. Otherwise, the tapes will be forced to pause for synchronization with the slower devices, causing unnecessary delay. Ideally, the disk mechanisms and SCSI channels should be chosen that are significantly faster than the tape transfer rates.

Some of the common data-transfer rates for sequential write-I/O of disk mechanisms are anywhere from 5MB/s to more than 15MB/s, depending on the rotational speed (RPMs) and data density (capacity) of the disks. Disk mechanism details were discussed in Chapter 4.

Number of Disks

For the database server's data volumes, there are usually enough disks configured to outperform the total number of parallel tape devices used for backup or restore. However, the log volumes typically have a fewer number of disks configured, making them a potential bottleneck during backup or restores. The number of disks used for the log volumes should be increased to the point where they are faster (at least 20%) than the tape device(s) used for backup or restore.

For example, if the restore of the logs is coming from one DLT8000 device with a 1:1.4 compression, then the write-I/O throughput to disk needed is 8.4MB/s (1 tape × 6MB/s × 1.4). Because an older 10K rpm disk can only handle ~6MB/s with sequential write-I/Os, two disks would be needed to be faster than the tape, assuming the disks are not performing any other activity. Because the disk log volumes are usually in RAID 1, four disks are required in total.

Number of Tapes per SCSI Channel

The SCSI channel should not be a performance bottleneck for the tape transfers. It's important to consider the sustained transfer rates of the SCSI channels as well as the data compression. Table 5-5 lists transfer rates of some of the common SCSI channels.

Table 5-5. Transfer Rates of SCSI Channels
SCSI Channels MB/s Sustained[a] GB/hr Sustained[b]
SCSI Fast/Wide 15 (20 max.) 54
UltraSCSI 30 (40 max.) 108
UltraSCSI-2 60 (80 max.) 216
Fibre Channel or SAN FC Switch 90 (100 max.) 324

[a] Sustained is typically three-quarters of theoretical maximum for parallel SCSI.

[b] One Gigabyte = 1000 Megabytes.

With a data compression of 1:1.4, two DLT7000 tapes can transfer ~14MB/s (2 tapes × MB/s × 1.4). A typical fast/wide SCSI bus can only sustain ~15MB/s, thus the limit of two tapes per bus with no other devices attached. With faster tape mechanisms, faster I/O channels are needed, such as Fibre Channel or UltraSCSI-2. Sizing the I/O channel must be considered whether using individual tape devices or tape libraries with multiple devices inside.

The newer tape backup solutions employ SCSI over native Fibre Channel. This allows more tape devices to be placed on one I/O channel. However, because many of the older tape device designs are still based on Fast/Wide SCSI, FC-to-SCSI bridges are needed.

Network Throughput

It is also possible to perform backups and restores over the network, usually with a dedicated backup network and a central backup server. This only makes sense to use with SAP database servers within the data center environment where high-speed LANs exist (see Table 5-6). Before the emergence of SANs, this was the traditional way to leverage a centralized backup system with a tape library. When using a network backup solution, making sure the high-volume backup traffic actually shows over the separate and dedicated backup network is the biggest challenge.

Table 5-6. Transfer Rates of Networks
Network Type MB/s (max.theoretical) GB/hr[a]
Ethernet 1.25 4.5
100Base-T Ethernet or FDDI 12.5 45
Gigabit Ethernet (1000Mbit per second) 125 450
Sustained network throughput will be lower, depending on the topology and utilization.

[a] One Gigabyte = 1000 Megabytes.

Server Memory and PCI I/O Buses

If enough SCSI channels are used at their maximum transfer rates, at some point, the internal I/O and system buses of the server can become a limiting factor. If the server is to be used for online backups for significant periods, then lots of system and I/O bus bandwidth is needed to keep enough headroom for other tasks online. The MB/s throughput capacity needed for the tape backup job should not exceed half of available I/O bandwidth during offline backups, and significantly less during online backups.

The PCI I/O bus can handle anywhere from 80MB/s to more than 200MB/s. This depends on the chip set in the server, on the number of PCI slots per bus, and on the PCI card type (MHz, 32-bit or 64-bit, etc.). A server's system bus (or memory bus) can usually handle 1GB/s or more, depending on the type. Very high-end servers can exceed 10GB/s system bus bandwidth, so they are able to handle even the largest backup jobs while online.

For example, a very large backup job, with 10 LTO tape devices at 15MB/s each, with 1:1.4 compression, could require up to 210MB/s I/O bandwidth (10 devices 15MB/s 1.4). Because the disks would need to keep up with this transfer rate for tape streaming, they also need 210MB/s I/O bus bandwidth. Both tapes and disks are attached locally to the same server, and the minimum system bus or memory bandwidth would need to be 420MB/s. If other tasks are performed during the backup job, more bandwidth will be needed.

Restore Type

The type of recovery also impacts the restore time. For example, if only a full-restore is needed (from full offline backup), then only one copy job from tape to disk is required. In many cases, however, incremental or differential backup jobs need to be applied as well, increasing the overall time before operations can continue.

An additional consideration to help minimize the amount of time a restore takes is to regularly archive the SAP database to keep the production database down to reasonable levels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.157