Chapter 19. Backing Up Your Data

When all else fails . . . restore from backup!

Keeping good backups is frequently the only way to recover from some accidents, disasters, or break-ins. Linux has a number of methods for archiving your data, ranging from tools like tar or cpio that come with Linux to feature-rich, sophisticated, commercial packages.

There are a number of ways to keep backups, ranging from simply taring your entire filesystem to a tape to using dump and a well-designed backup schedule to running high-end commercial backup and restore programs.

Of course, you also have a variety of backup media to choose from: floppies (if you are truly desperate), tape drives (from old nine tracks up to DAT), write once, read many (WORM) drives, magneto-optical floppies (written magnetically, read optically), and ZIP and JAZ drives. Of these, only the newer tape options and JAZ drives have gigabyte capacity, and only exabyte and DAT can handle more than a couple of gigabytes. Unfortunately, tapes are in many ways the most cumbersome to work with because of their sequential access nature.

With the new proposed standards for CD-ROM- (and thus WORM-) based filesystems, this media will soon sport multiple gigabyte capacity. Though they will still be write once, the easier access (random instead of sequential) will likely make them popular once their price drops into the affordable range. In actuality, the technology for this has existed for some time, but the popularity of the current standard has kept it from being replaced. However, the need to store large amounts of data reliably has grown tremendously in recent years, and CD-ROMs are a big win over tape in both ease of access and longevity (decades versus years).

The rest of this chapter assumes you have some form of large capacity media on which to back up your data, though many of the guidelines are useful regardless of the media being used or its capacity.

In general, you will want to run your backup during times of low activity for two reasons. One, it is not a good idea to have lots of files open or changing while the backup is running. Some programs will lock a file and the backup will, at best, be able to skip it, or, at worst, the backup will exit or crash. Second, backing up your system will tend to consume a lot of system resources, and if your server is fairly loaded already, it will likely be bogged down considerably if you try to run the backup during "normal" business hours (whatever those might be).

tar and mt

The name "tar" stands for tape archive. Before the more recent versions of GNU, tar was not really suitable for use as a backup tool. It had no easy way of performing incremental backups and properly handling files with holes (so-called sparse files).

At the base level, you can create a cron job that will compress and archive all of your filesystems and network volumes to a tape drive every night or every few nights. There are a couple of drawbacks to this method, however.

Typically, only a small fraction of the filesystem changes over the course of even a week. Making a copy of the whole filesystem every other night is over-kill. It would be better if incremental backups could be made.

It's very possible that you could have more stuff to back up than you have capacity on your particular combination of tape and drive. In this case, you would have to break up the backup and change tapes at some point to complete the backup. Once again, incremental backups would alleviate this.

If your backup, or sections thereof, is small enough, you may want to put more than one on a single tape, and then you may need some way to access a particular archive at some later point. You could use tar's append feature, but then the ability to maintain different versions is lost. When the archive is unpacked, the appended files will overwrite ones with the same names that were extracted earlier.

Using tar

The tar command has three major functions: create a tar file, list the contents of a tar file, and extract files from an existing tar file. In terms of switches to tar, there are c, t, and x, respectively. These switches get combined with others to create an efficient means of accomplishing backup and recovery.

Other commonly used switches include f for filename, v for verbose, and z to enable gzip (de)compression. The f option requires the entry of a filename or a hyphen to indicate standard input. If multiple switches that require arguments are used, you can line up both the switches followed by the arguments, or list a switch followed by its argument. So, it is perfectly legal to run either of these commands:

tar -cvf foo.tar file1 file2 file3
tar -c file1 file2 file3 -vf foo.tar

In fact, these two commands do the exact same thing.

If the -z switch were added, the tar would also compress with gzip. Typically, compressed tar files are named foo.tgz or foo.tar.gz.

You can also compress archives with the command:

# tar zxvf foo.tar.gz

or read a compressed archive from a SCSI tape device with:

# tar zxvf /dev/nst0

When using tar, you must know the block size when backing up and restoring files. Get the block size wrong, and you won't be able to read the files you backed up. You may also want to change this block size if the tape drive is slow, or is higher-speed, since the default block size on some machines is rather small (the Linux version of tar has a block size of 20° 512 bytes, or 10KB). Increasing the block size decreases the amount of overhead when writing to the tape and can increase the performance and decrease the write time to the tape. The block size is added with a -B <blocksize> option to tar, where <blocksize> is multiplied by 512 to get the block size in bytes.

Now that we have beaten up on tar quite a bit, let's point out a few things that tar is good at doing. tar can archive files that are not currently needed or take a snapshot of a project. If one of your users has some data they still need but will not be using for awhile, take advantage of this to tar the data to a tape and remove it from the server. Also, you may need to keep snapshots of projects, but not necessarily online.

As mentioned earlier, tar is commonly used to create packages or distributions of documentation, source code, or precompiled software.

Finally, tar can be used to transfer directory trees around in your filesystem. Its many command line options let you control preservation of permissions, modification dates, and the following of symbolic links. This allows you to re-create a directory tree on another partition with more control than cp -a. An example illustrates this nicely:

# cd /newhome
# (cd /home ; tar cf -cary)| tar xf -

This will re-create the directory /home/caryc as a subdirectory of /newhome. It will preserve the permissions and ownership of the files and recreate (but not follow) symbolic links (assuming the links are not stale).

In general, it is not a good idea to follow symbolic links unless you know where all the links in the tree are pointing. If a link points back down to a directory that is within the archive, you will start recursively re-creating the directory tree within the subdirectory and very quickly eat up a large amount of disk space.

Using mt

To create different sections of tape, you need to use both tar and mt (for magnetic tape). The mt program is used to control tape devices. Since the tar program starts at the beginning of a block and adds an EOF (end of file) at the end of each block, you can use the mt -f <device>, fsf <count> command to advance sections of tape. See below for a note on figuring out the <device> and <count> is the number of times to perform that operation.

For example, if you had five tar files on a tape and wanted to add a sixth, you would first advance the tape with mt -f /dev/nst0 fsf 5 to get to the next block after the fifth end-of-file marker. Once you had arrived at the last EOF mark on the tape, you would then use tar -cvf to write new information. If you needed to recover data from a particular section, you could use the mt command to advance the tape to the section you needed, then use tar -xvf to restore the data from tape. You would want to use this if you were backing up multiple filesystems to a tape. You could then have each filesystem use a different section of tape and restore only the section you needed.

Tape Operations

Another feature of tar is its use of different device names. Based on the device name that tar uses, it can make the tape rewind to the beginning or remain in place after completing. This is important, since if tar backs up a directory, rewinds, and backs up another directory, the first backup is overwritten. Using mt to advance the tape is time-consuming and increases the amount of coding you need to do to calculate how far to advance the tape.

Using the /dev/nst0 device name will tell the tape device to not rewind after completing. Note that you'll need to use this when advancing the tape with mt, or else the tape will rewind after being advanced! The regular /dev/st0 will rewind after the command is completed, and can be used after a restore to automatically rewind the tape. The mt command also supports an explicit rewind, and if the tape device supports it, can also eject the tape. These commands are mt -f /dev/st0 rewind and mt -f /dev/st0 offline, respectively. Other commands to mt include re-tension (forward to the end of the tape, then rewind), status (to give a status of the tape drive), erase (to erase the tape), and a number of SCSI-only options relating to hardware compression, buffering, density, and others. Check the man page for mt for a complete list of the available options.

cpio

In many ways, cpio is similar to tar. It supports more formats (including the format used by tar) and can also deal with archives from machines with different byte orders. Like tar, cpio can write or read to network devices as well as local ones. Finally, cpio can be used in pass-through mode to copy directory trees.

The choice between using cpio or tar to perform backups is largely a matter of preference. If you come across a system that does not have Red Hat installed on it, other Red Hat systems (or the RPM install file) contain a program called rpm2cpio. The purpose of this program is to convert an RPM file to cpio so that it can be extracted and installed on other systems.

The cpio program has (at last count) a billion available switches to it. Much like tar, these can be shortened (by us anyway) down to a few major ones you need to know about. These include: -F <file> to output to a file instead of STDOUT, -i for extract, and -o for create. These seem to make less sense than the tar commands, except for the fact that cpio thinks of the extract command as "copy in" and the create command as "copy out".

A few examples should help clarify things. One of the simplest things you can do with cpio is archive the files in a directory (ignoring subdirectories):

# ls  cpio -o > /tmp/stuff.cpio

To include directories, you do this:

# find . -print -depth | cpio -o > ~/morestuff.cpio

To create the file on another machine (assuming you have permission to, usually via an entry in ~/.rhosts):

# find . -print -depth | cpio -o -F cary@loki:/archive/stuff.cpio

To extract the files from the archive:

# cpio -i < stuff.cpio

Note that cpio will not preserve directory structure unless explicitly told to as follows:

# cpio -id < ~/morestuff.cpio

dump and restore

dump is probably the best free alternative for performing backups. It makes a fairly low-level copy of the filesystem. Because of this, any type of file (including sockets and block and character devices) can be archived and files that have empty blocks in them are properly saved. Additionally, it can perform incremental backups and archives can span multiple tapes. One final nicety is that dump has no limit on the length of file- and pathnames.

Incremental backups are controlled by assigning a dump level to a particular backup. Dump levels range from 0 to 9, and when a dump of a certain level N is performed, all files that have changed since the last dump of level N-1 or lower are sent to the tape. As you might guess, a level 0 dump will dump the entire filesystem.

The dump program requires you to know the length and density of the media being used for the backup. In the case of devices using data compression, there is a virtual length associated with the device, which is simply the compression ratio times the actual tape length. It's usually best to be conservative here; the compression ratio is really an average—not all files compress equally. If you over-estimate the amount of compression, dump will try to write after the tape has run out and the dump will be ruined.

dump is one of the lowest-level methods for backing up your system, but it is also one of the least fault-tolerant. Unless you understand it and your backup device very well, you may want to choose another method of backing up your system.

Commercial Backup Products

BRU

The Backup and Restore Utility (BRU) is strongly based on tar but adds many more features. It runs a daemon that manages the backup schedule. It comes in two versions, the more expensive of which supports backing up NFS disks; the less expensive version works only on local disks.

BRU also comes with a menu-driven interface in both X and ASCII. The concept of backup levels is supported, and backup targets can be any character device.

Other features include the following:

  • Keeps track of the number of uses of a tape to help you decide when to throw out older tapes.

  • Contains powerful features for recovering data from corrupt archives.

  • Has support for NIS (Network Information Services, a networked system for passwd, group, and host files, and much more).

  • Has support for SMB (Samba) and Netware volumes.

  • Supports archives that span more than one backup device.

BRU is available for Linux on the x86, Alpha, and PowerPC, and a lite version is included with the commercial version of Red Hat. See http://www.estinc.com/ for more information.

PerfectBACKUP+

PerfectBACKUP+ is widely acclaimed as the fastest backup and restore program. Previously sold as FASTBACK PLUS for UNIX, it began life as the UNIX version of DOS FASTBACK PLUS. It comes with menu-driven ASCII and Motif interfaces, networking support, compression, verification and recovery, and scheduling, and it is compatible with both tar and cpio. It will also back up a variety of network drives, including NFS, Windows, Netware, and Windows NT.

Other features include:

  • Compression.

  • Network backup devices.

  • Multiple backup devices—when one is full, it will move to the next device in the list.

  • Locking files during backup to allow safer backups in multiuser mode.

  • Support for all file types, including sockets, pipes, and so forth.

More information on PerfectBACKUP+ is available from http://home.xl.ca/perfectBackup/.

BACKUP/9000

Though this is available only in beta for Linux as of this writing, it has one feature that prompted us to include it. It is designed to work with Oracle to allow live, safe backups of Oracle's tablespaces. If you are running Oracle (for SCO using the IBSC emulation; there is no native Linux version), you will undoubtedly find this feature of great use. Beyond this, Backup/9000 supports what you would expect of a commercial backup and restore program:

  • Nice user interface.

  • Scheduling tool.

  • Support for local and network backups.

  • Uses tar and cpio formats.

  • Encryption.

  • Parallel backups to multiple local and/or network tape drives.

  • Backup of raw partitions and FIFO streams, in addition to normal files.

  • Multiple backups on one tape.

Backup Strategies

How often you make what kind of backup depends on several factors, including:

  • The capacity and speed of your backup device, which are particularly important characteristics for unattended backup.

  • How active your filesystems are. The number of files that change per day or week. Most likely, this will vary from partition to partition.

  • Whether or not you can make live backups.

  • If you can't make live backups, how much downtime is acceptable and when is it least inconvenient.

In general, if more than twenty percent of a particular filesystem is changing daily, you should perform incremental backups every day or two, a lower-level backup weekly, and a full backup monthly. Lower activity levels mean you can space this out more, though there's really no excuse for not performing a full backup once a month.

Another method (if you have enough tape drives) is to perform a full backup every night. Keep 28 tapes so that you have a total of four weeks worth of data at any one time. In the event of a bug that corrupts data going back a few days (or a particular backup getting corrupted), you can restore a tape from a few days before. Then, cycle tapes every six months or so. Since heavy use of tapes can cause them to deteriorate, you'll want to replace the tapes with new ones well before this point. This allows you to store information for an indefinite amount of time. This is at a higher cost, however, since you have to buy a lot of tapes and you may not have enough storage on the tapes; but, it's one of the best and easiest ways to backup and restore your data.

Buy lots of whatever your backup media is; it's cheaper than having to pay the office for a day of doing work a second time.

For very active systems which more or less cannot be taken off-line except at 3 AM on a Sunday, it would be wise to invest in a backup utility that can perform safe live backups during periods of low traffic. You should then be able to perform unattended, incremental backups, and depending on your capacity and amount of data, unattended full backups.

More complicated databases are typically problematic when it comes to live backups. Usually, your backup utility cannot make a proper snapshot from the database's point of view. Fortunately, most databases can create a snapshot of themselves, and that can be placed in your archive. Failing this, the database will have to be taken off-line before being backed up.

You will probably want to keep some backups off-site in case of a big disaster (flood, UFO crashing into your office, dinosaur rampage, etc.). You may have to buy a new system, but at least you'll have what is likely hundreds, if not thousands, of hours of work saved.

RAID and Disk Mirroring

These are other techniques for protecting the integrity of your data. RAID stands for Redundant Array of Inexpensive Disks. It is a fairly new concept, having been introduced in 1987 at the University of California, Berkeley.

The basic idea behind RAID is that by using multiple small disks (as opposed to a few large disks) and possibly some additional "parity" data, you will be able to reconstruct data lost when one of the disks fails instead of losing a few gigabytes (albeit temporarily, assuming that you have a backup).

There are several different levels of RAID, cleverly numbered from 0 to 5. Their features are given below. Only levels 1, 4, and 5 are available for Linux, and only in software (as opposed to a hardware-level implementation).

  • 0 Data striping—This doesn't actually provide any protection against data loss, but does enhance performance. Requires a controller that supports synchronized disks; none is available for Linux.

  • 1 Disk mirroring— A second set of disks is used to provide a complete copy of your data. An expensive option, since you need twice as many disks.

  • 2 Level 0 with a check disk for storing error correction information. Poor performance has made this option unpopular.

  • 3 Level 0 with a separate disk for byte-level parity information to help reconstruct lost data. If the parity disk is lost, you lose all data integrity. Synchronized disks help boost performance despite the bottleneck of the single-parity disk.

  • 4 Level 3 with block-level parity information stored on a separate disk. Better read performance, but worse write performance than level 3. Since synchronized disks aren't used, it can be implemented in software.

  • 5 Level 4 with the parity information spread over the disks as well. Higher performance, since a separate parity disk is no longer a potential bottleneck. Write performance is still not as good as for level 3.

To implement RAID on your Linux box, you need a kernel patch from http://www.linuxhq.com/patch/20-p0632.html and a 2.0.30 kernel. The patch is still in beta and supports RAID levels 1, 4, and 5.

Separately available, raidtools-0.3 can be used to create or repair a set of RAID disks.

Summary

A variety of backup and restore tools were discussed in this chapter, ranging from simple ones that ship with every Linux distribution to commercial ones with enhanced functionality and interfaces.

In general, for sites where short amounts of down-time are acceptable, the built-in tools are fine. For "hot" backups, you will probably have to resort to commercial means.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.208