2. Backing It All Up

Don’t Skip This Chapter!

The casual reader might assume that this chapter is an introduction to basic backup concepts. While that is, in fact, the purpose of this chapter, it is also true that many seasoned administrators are unfamiliar with the ideas presented here. One reason for this is that administrators find themselves constantly being pulled away from “mundane” activities like backups for things that are thought to be more “important,” such as installing new servers and figuring out why systems are running slowly. Also, administrators may go several years without ever needing to perform a restore. The need to use your backups on a regular basis would undoubtedly change your ideas about their importance.

I wrote this book because backup and recovery has been my primary area of emphasis for several years, and I would like to share the lessons I’ve learned from this focused activity. This chapter provides an overview of how your backups should work. It also explains many basic yet extremely important concepts upon which any good backup plan should be based, and upon which all implementations discussed in this book are based.

The Impossible Job That No One Wants

Would anyone reading this book say that losing data is OK? I don’t believe so. Then why do we treat backups so lightly? Sometimes I feel like Rodney Dangerfield when I’m arguing for better backups—“I tell ya, I don’t get no respect, no respect.” Backups often aren’t considered during systems design. When a new server is purchased, does anyone ask for the impact on the current backup methodology? Some IT departments do not even have control over the purchase of new systems, because they are sometimes bought by other cost centers. Have you ever tried to explain to another department manager why his terabyte-sized database server isn’t going to get backed up to the standalone, gigabyte-sized tape drive that came with it?

Another often-overlooked issue is backup personnel. Have you ever tried to find the person in charge of backups? It’s often an extra duty that gets passed around, in a manner similar to the way my sister and brother and I argued over whose turn it was to wash the dishes. If you are lucky enough to have a dedicated person, it’s usually the most junior person in the company. I know, because that’s how I got my first job. In fact, that’s how many people get their first jobs. How can we give such low priority to something so important? Perhaps we should change that. Will one book change this long-standing hiring tradition? Probably not, but maybe it will help. At the very least, if the person in charge of backups has this book, that person has a complete guide to accomplishing the immense task that lies ahead.

What’s the big deal, you say? With modern computer systems and reliable disk drives, why are backups still so important? Because computers still go down, that’s why. Also, companies are placing more reliance than ever on computers functioning reliably. I don’t care how good your Unix vendor is or how reliable your disk drives are or even if you have Dogbert himself as your network administrator, systems go down. Murphy’s Law thrives in computer systems. Not only will your computer systems go down occasionally, but they will do so at the time most inconvenient to you and your customers. At that moment, and that moment will come, it is the job of the backup person to replace the data on the disk or disks that have stopped the show. “How long will it take?” is a typical question. The only acceptable response is “it’s already done.”

Who wants to be the person who messed up the restore and caused the customer database to be offline for three extra hours? Who wants to be the person who has to send a memo to the entire company saying that any purchase orders entered in the last two days have to be reentered? Who wants to be the person who has that in mind every day as they are checking the results of last night’s backups? If you do your job well, and no data is lost, you are just doing what you’re supposed to do. If you mess up, you’re in big trouble. Who wants that job? No one, that’s who.

You’re reading this book because you’ve got the impossible job that nobody wants. Whether you’ve been doing it for a while or have just started down the backup road, you can see that the task that lies ahead is immense. The volume of data is tremendous, the nature of the data changes constantly, and the utilities at your disposal never seem to be up to the job. I know because I’ve been there. I’ve spent months trying to implement “solutions” from operating systems and database products that weren’t ready. I’ve seen companies spend money on expensive commercial utilities, only to buy the wrong utility for their application. I’ve watched newer and bigger servers roll in the door without a single backup drive among them. I’ve also spent long nights and weekends in computer rooms trying to recover data in a “reasonable” amount of time. Unfortunately, “reasonable” is defined by the end user who has no idea how difficult this job is.

There are now solutions to almost every backup problem out there. If you run a small shop with just a few systems, all of which run the same operating system, there’s a solution for you. If you work in a huge shop with hundreds of boxes in the various flavors of Unix, Linux, Windows, and Mac OS, or just a few multiterabyte databases, there’s a solution for you. The biggest part of the problem is misinformation. Most people simply do not know what is available, so they either suffer without a solution or settle for an inferior one—usually the one with the best salesperson. The six important questions that you have to continually ask yourself and others are why, what, when, where, who, and how:

Why?: Why are you protecting yourself against disaster? Does it really matter if you lose data? What will the losses be? What different types of data do you have, and what is the value of each type?
What?: What are you going to back up, the entire box or just selected drives or filesystems? What operating systems are you going to back up? What else, besides normal drives or filesystems, should be included in a backup?
When?: When is the best time to back up your system? How often should you do a full backup? When should you do an incremental backup?
Where?: Where will the backup occur? Where is the best place to store the backup volumes?
Who?: Who is going to provide the hardware, software, and installation services to put this system together?
How?: How are you going to accomplish it? There are a number of different ways to protect yourself against loss. Investigate the different methods, such as off-site storage, replication, mirroring, RAID, and the various levels of protection each provides. (Each of these topics is covered in detail in later sections of this book.)

Deciding Why You Are Backing Up

If you can’t answer this question, there’s really no point in moving forward. The good thing is that it is a really easy question to answer. Just think about all of the various things that can happen to your data, and then look at all of the types of data that you have. You should be familiar with each business unit that creates data, and how that business unit would be affected if that data was lost or damaged. All of this becomes your business justification for moving forward.

Deciding What to Back Up

Experience shows that one of the most common causes of data loss is that the lost data was never configured to be backed up. The decision of what to back up is an important one.

Plan for the Worst

When trying to decide what files to include in your backups, take the most pessimistic technical person in your company out to lunch. In fact, get a few of them together. Ask them to come up with scenarios that you should protect against. Use these scenarios in deciding what should be included, and they will help you plan the “how” section as well. Ask your guests: “What are the absolute worst scenarios that could cause data loss?” Here are some possible answers:

An entire system catches fire and melts to the ground, leaving an unrecognizable mass of molten metal and blackened, smoking plastic.
Since this machine was so important, you had it replicated to another node right next to it. Of course, that machine catches fire right along with this one.
You have a centralized server that controls all backups and keeps a record of backup volume locations and what files are on what volumes, and so on. The server that blew up sits right next to this “backup server,” and the intense heat takes this system with it.
The disastrous chain reaction continues, taking out your DHCP and Active Directory servers, the NIS master server, the NFS and CIFS home directory servers, and the database server where you house the inventory of all your backup volumes with their respective locations. This computer also holds the telephone database listing all service agreements, vendor telephone numbers, and escalation procedures.
You haven’t memorized the number to your new off-site storage vendor yet, so it’s taped to the wall next to your backup server. You realize, of course, that the flames just burned that paper beyond recognition.
The flames set off the sprinkler system, and water pours all over your backup volumes. Man, are you having a bad day....

What do you do if one of these scenarios actually happens? Do you even know where to start? Do you know:

What volume contains last night’s backup?
Where you stored it?
How to get in touch with the off-site storage vendor to retrieve the copies of your backup volumes? And once you find that out, whether your server and network equipment will be available to recover?
Who to call to get replacement equipment at 2:00 a.m. on a Saturday?
What the network looked like before all the wires melted?

First, you need to recover your backup server, because it has all the information you need. OK, so now you found the backup company’s card in your wallet, and you’ve pulled back every volume they had. Since your media database is lost, how will you know which one has last night’s backup on it? Time is wasting....

All right, you’ve combed through all the volumes, and you’ve found the one you need to restore the backup server (easier said than done!). Through your skill, cunning, and plenty of help from tech support, you restore the thing. It’s up and running. Now, how many disks were on the systems that blew up? What models were they? How were they partitioned? Weren’t some of them striped together into bigger volumes, and weren’t some of them mirroring one another? Where’s that information stored? Do you even know how big the drives or filesystems were? Man, this is getting complicated....

A biotech firm with a number of servers that were considered validated systems for FDA CFR21 purposes lost a critical database that was running on one such server. When they went to their backup server to restore it, they discovered to their horror that that server had not been backed up for approximately three months. Somehow, it had been removed from the backup schedule, so no “errors” were showing up, and they were now without anything remotely approaching a current backup. The problem escalated up to the CEO of the company.

Jim Damoulakis

Didn’t you just install that big jumbo kernel patch last week on three of these systems? (You know, the one that stopped all those network broadcast storms that kept bringing your network down in the middle of the day.) You did make a backup of the kernel after you did that, didn’t you? Of course, the patch also updated files all over the OS drive. You made a full backup, didn’t you? How will you restore the operating system drive, anyway? Are you really going to go through the process of reinstalling the operating system just so you can run the restore command and overwrite it again?

Filesystems aren’t picky about size, as long as you make them big enough to hold the data that you restore to them, so it’s not too hard to get those filesystems up and running. But what about the database? It was using raw partitions. You know it’s going to be much pickier. It’s going to want /dev/rdsk/c7t3d0s7, /dev/dsk/c8t3d0s7, and /dev/dsk/c8t4d0s7 right where they were and partitioned just as they were before the disaster. They also need to be owned by the database user. Do you know which drives were owned by that user before the crash? Which disks were those again?

It could happen.

Tip

Part IV covers these Catch-22 situations.

Take an Inventory

Make sure you can access essential information in the event of a disaster:

Backups for your backups: Many companies have begun to centralize control of their backups, which I think is a good thing. However, once you centralize storage of all your backup information, you have a single point of failure for your entire backup plan. You can’t restore the backup server because you don’t have the database of your backups. You don’t have the database of your backups because you need to restore your backup server. Restoring this server would be the first step in any multisystem outage. For things like media inventory, don’t underestimate the value of an inventory printed on paper and stored off-site. That paper may just get you out of this Catch-22. Given the single-point-of-failure factor, the recovery of your backup server should be the easiest and best-documented recovery that you have. You may even want to investigate creating a special tar, ntbackup, or rsync backup of that data to make it even easier to recover during a disaster.
What peripheral devices did you have?: Assuming you back up your disk drive configuration on a regular basis, you might have a list of all the disk drives, but do you know what models they are? If you have all Brand-X 500 GB drives, you have no problem, but many servers have a mixture of drives that were installed over time. You may have a collection of 40 GB, 100 GB, and 500 GB drives, all on the same system. Make sure that you are recording this in some way. Unix and Mac OS systems record this information in the messages file, and Windows stores it in the registry, so hopefully you’re backing those up.
How were they partitioned?: This one can really get you, especially if you have to restore the operating system drive or a database drive. Both drives are typically partitioned with custom partitions that must be repartitioned exactly the same as before for a proper restore to occur. Typically, this partition information is not saved anywhere on the system, so you must do something special to record it. On a Solaris system, for example, you can run a prtvtoc on each drive and save that to a file. Search on the Internet for scripts for capturing this information; a number of such free utilities exist.
How were your volume managers configured?: A number of operating system-specific volume managers are available, including Veritas Volume Manager, Windows Dynamic Drives, Solstice (Online) Disk Suite, and HP’s Logical Volume Manager. How is yours configured? What devices are mirrored to what? How are your multidisk devices set up? Unbelievably, this information is not always captured by normal backup utilities. In fact, I used Logical Volume Manager for months before hearing about the lvmcfgbackup command (it backs up the LVM’s configuration information). Sometimes if you have this properly documented, you may not need to restore at all. For example, if the operating system disk crashes, simply put the disks back the way they were and then rebuild the stripe in the same order, and the data should be intact. I’ve done this several times.
How are your databases set up?: I have seen many database outages. When I ask a database administrator (DBA) how her database is set up, the answer is almost always, “I’m not sure....” Find out this information, and record it up front.
Did you document how you set up DHCP, Active Directory, NFS, and CIFS?: Document, document, document! There are a hundred reasons to properly document things like this, and recovery from a disaster is one of them. Good documentation is definitely part of the backup plan. It should be regularly updated and available. No one should be standing around saying “I haven’t set up NIS/AD/NFS from scratch in years. How do you do that again? Has anyone seen my copy of O’Reilly’s book?” Actually, the best way to do this is to automate the creation of new servers. If your operating system supports it, take the time to write scripts that automatically install various services, and configure them for your environment. Put these together in a toolkit that is run every time you create a new server. Better yet, see if your OS vendor has any products that automate new server installations, such as Sun’s Jumpstart, HP’s Ignite-UX, Linux Kickstart, and Mac OS cloning features.
Do you have a plan for this?: The reason for describing the earlier horrible scenarios is so you can start planning for them now. Don’t wait until there’s 20 feet of snow in your front yard before you start shopping for a snow shovel! It’s going to snow; it’s only a question of when. Take those pessimists out to lunch, let them dream of the worst things that could happen, and then plan for them. Have a fully documented, step-by-step plan for the end of the computer world as you know it. Even if the plan needs a little modification when you actually have to use it, you will be glad you have a starting point. It’s a whole lot better than standing around saying, “What do we do now? Has anyone seen my résumé?” (You did keep a hardcopy of it, right?)
Know what’s on your boxes!: The best insurance against almost any kind of loss is for the backup/recovery person to be familiar with the systems he is protecting. If a particular server goes down, you should know immediately that it contains an Oracle or SQL Server database and should be running for those volumes. That way, the moment the server is ready for a restore, so are you. Become very involved in the installation of any new system or database. You should know what database platforms you are using and how they are set up. You should know about any new drives, filesystems, databases, or systems. You need to be very familiar with every box, what it does, and what’s on it. This information is vital so that you can include any special backups for that type of system.

It was my very first gig out of college, so I was primarily supposed to be doing desktop support while learning at the feet of a high-priced Unix consultant, who we’ll call Fred.

We were supporting a ForEx trading app called Opus that ran on SunOS. When it stored trades, half the information was in the path. For example, if someone made a USD-GBP trade on June 15 with someone from Bank of New York, the path and file would look like this:

/opt/app/opus/transactions/portfolio/third-party/...etc...etc.../USD/CAI/GBP/BONY/ask/19970615120453.2372149821335

This insipid design was surely not Fred’s fault, but he did set up the backups for it. I discovered that he had set up a tar job using –v, which produced logs so big he wasn’t looking at them. Once I removed –v and started watching the logs, I found out that backups had been failing. The version of tar that shipped with SunOS at the time choked on file paths longer than 100 characters or so. The trading stubs were all about nine characters too long. Fred was basically tarring up a huge directory tree with no files at the bottom. Had he ever looked at the logs, he would have known that.

I was designated the primary Unix admin the next day. The company didn’t renew Fred’s contract.

Jim “Sparky” Donnellan

Are You Backing Up What You Think You’re Backing Up?

I remember an administrator at one of my previous employers who used to say, “Are we getting this on tape?” He always said it with his trademark smirk, and it was his way of saying “Hi” to the backup guy. His question makes a point. There are some global ways that you can approach backups that may drastically improve their effectiveness. Before we examine whether to back up part or all of the system, let us examine the common practice of using include lists and why they are dangerous. Also, let’s consider some of the ways that you can avoid using include lists.

What are include and exclude lists? Generically speaking, there are two ways to back up a system:

You can tell your backup system to back up everything, except what is in an exclude list, for example:

For Unix, Linux, and Mac OS servers:

Include: *
Exclude: /tmp, /junk1, /junk2

For Windows servers:

Include: *
Exclude: *.tmp, *Temporary Internet Files*, ~*.*, *.mp3

You can tell your backup system to back up what is in an include list, for example:
- For Unix, Linux, and Mac OS servers:
- ```
Include: /data1, /data2, /data3
```
- For Windows servers:
- ```
Include: D:, E:
```

Looking at these examples, ask yourself what happens when you create /data4 or the F: drive? Someone has to remember to add it to the include list, or it will not be backed up. This is a recipe for disaster. Unless you’re the only one who adds drives or filesystems and you have perfect memory, there will always be a forgotten drive or filesystem. As long as there are other administrators and there is gray matter in your head, something will be left out.

However, unless your backup utility supports automated drive or filesystem discovery, it takes a little effort to say, “Back up everything.” How do you make the list of what systems, drives, filesystems, and databases to back up? What you need to do is look at files such as /etc/vfstab or the Windows registry and parse out a list of drives or filesystems to back up. You can then use exclude lists to exclude any drives or filesystems you don’t want backed up.

Oracle has a similar file in Unix, called oratab, which can be used to list all Oracle instances on your server.^[1] Windows stores this information in the registry, of course. You can use oratab to list all instances that need backing up. Unfortunately, Informix and Sybase databases have no such file unless you manually make one. I do recommend making such a file for many reasons. It is much easier to standardize system startup and backups when you have such a file. If you design your startup scripts so that a database does not get started unless it is in this file, you can be reasonably sure that any databases that anyone cares about will be in this file. This means, of course, that any important databases are backed up without any manual intervention from you. It also means that you can use the same Informix and Sybase startup scripts on every system, instead of having to hardcode each database’s name into the startup scripts.

How do you know what systems to back up? Although I never got around to it, one of the scripts I always wanted to write was a script that monitored the various host databases, looking for new systems. I wanted to get a complete list of all hosts from Domain Name System (DNS) and compare it against a master list. Once I found a new IP address, I would try to determine if the new IP address was alive. If it was alive, that would mean that there was a new host that possibly needed backing up. This would be an invaluable script; it would ensure there aren’t any new systems on the network that the backups don’t know about. Once you found a new IP address, you could use nmap to find out what type of system it is. nmap sends a malformed TCP packet to the IP address, and the address’s response to that packet reveals which operating system it is based on.

Tip

Some commercial data protection management software packages now support this functionality.

Back Up All or Part of the System?

Assuming you’ve covered things that are not covered by normal system backups, you are now in a position to decide whether you are going to back up your entire systems or just selected drives or filesystems from each system. These are definitely two different schools of thought. As far as I’m concerned, there are too many gotchas in the selected-filesystem option. Backing up everything is easier and safer than backing up from a list. You will find that most books stop right there and say “It’s best to back up everything, but most people do something else.” You will not see those words here. I think that not backing up everything is very dangerous. Consider the following comparison between the two methods.

Backing up only selected drives or filesystems

Here are the arguments for and against selective backups.

Save media space and network traffic.

The first argument that is typically stated as a plus to the selected-filesystem method is that you back up less data. People of this school recommend having two groups of backups: operating system data and regular data. The idea is that the operating system backups would be performed less often. Some would even recommend that they be performed only when you have a significant change, such as Windows security patches, an operating system upgrade, a patch installation, or a kernel rebuild. You would then back up your “regular” data daily.

The first problem with this argument is that it is outdated; just look at the size of the typical modern system. The operating system/data ratio is now significantly heavier on the data side. You won’t be saving much space or network traffic by not backing up the OS even on your full backups. When you consider incremental backups, the ratio gets even smaller. Operating system partitions have almost nothing of size that would be included in an incremental backup, unless it’s something important that should be backed up! This includes Unix, Linux, and Mac OS files such as /etc/passwd, /etc/hosts, syslog, /var/adm/messages, and any other files that would be helpful if you lost the operating system. It also includes the Windows registry. Filesystem swap is arguably the only completely worthless information that could be included on the OS disk, and it can be excluded with proper use of an exclude list.

Harder to administer.

Proponents of piecemeal backup would say that you can include important files such as the preceding ones in a special backup. The problem with that is it is so much more difficult than backing up everything. Assuming you exclude configuration files from most backups, you have to remember to do manual backups every time you change a configuration file or database. That means you have to do something special when you make a change. Special is bad. If you just back up everything, you can administer systems as you need to, without having to remember to back up before you change something.

Easier to split up between volumes.

One of the very few things that could be considered a plus is that if you split up your drives or filesystems into multiple backups, it is easier to split them between multiple volumes. If a backup of your system does not fit on one volume, it is easier to automate it by splitting it into two different include lists. However, in order to take advantage of this, you have to use include lists rather than exclude lists, and then you are subject to the limitations discussed earlier. You should investigate whether your backup utility has a better way to solve this problem.

Easier to write a script to do it than to parse out the fstab, oratab, or Windows registry.

This one is hard to argue against. However, if you do take the time to do it right the first time, you never need to mess with include lists again. This reminds me of another favorite phrase of mine: “Never time to do it right, always time to do it over.” Take the time to do it right the first time.

The worst that happens? You overlook something!

In this scenario, the biggest benefits are that you save some time spent scripting up front, as well as a few bytes of network traffic. The worst possible side effect is that you overlook the drive or filesystem with your boss’s budget that just got deleted.

Backing up the entire system

The pros for backing up the entire system are briefer yet far more compelling:

Complete automation.

Once you go through the trouble of creating a script or program that works, you just need to monitor its logs. You can rest easy at night knowing that all your data is being backed up.

The worst that happens? You lose a friend in the network department.

You may increase your network traffic by a few percentage points, and the people looking after the wires might not like that. (That is, of course, until you restore the server where they keep their DNS source database.)

Backing up selected drives or filesystems is one of the most common mistakes that I find when evaluating a backup configuration. It is a very easy trap to fall into because of the time it saves you up front. Until you’ve been bitten though, you may not know how much danger you are in. If your backup setup uses include lists, I hope that this discussion convinces you to rethink that decision.

Deciding When to Back Up

This might appear to be the most straightforward topic. Everybody backs up their system every night, right? What’s the big deal? Actually, this could more aptly be titled “What levels do I run when?” It’s always a big question. How often do you run a full backup? How often do you run incremental backups? Do you run various levels of incrementals that back up just today’s changes or continuous incremental backups that back up everything since the last full backup? Everyone has her own answers to these questions. The only thing that is a definite is that there should be at least some level of backup every night. Before any further discussion on the topic, let’s define some terms.

Backup Levels

The following are various backup levels. These terms are not used the same way by everyone.

Full/Level 0: A full backup.
Level 1: An incremental backup that backs up everything that has changed since the last level 0 backup. Repeated level 1 backups still back up everything since the last full/level 0 backup.
Levels 2–9: Each level backs up whatever has changed since the last backup of the next-lowest level. That is, a level 2 backs up everything that changed since a level 1, or since a level 0, if there is no level 1. With some products, repeated level 9 backups back up only things that have changed since the last level 9 backup, but this is far from universal.
Incremental: Usually, a backup backs up anything that has changed since the last backup of any type.
Differential: Most people refer to a differential as a backup that backs up everything that has changed since the last full backup, but this is not universal. In Windows, a differential is a backup that does not clear the archive bit. Therefore, if you run a full backup followed by several differential backups, they act like differential backups in the traditional sense. However, if you run even one incremental backup in Windows, it clears the archive bit, and the next differential backup backs up only those files that have changed since the last incremental backup. That’s why a differential backup is not synonymous with a level 1 backup.
Cumulative incremental: I prefer this term to differential, and it refers to a backup that backs up all files that have changed since the last full backup.

Tip

Backup products and backup administrators do not agree on these definitions. Make sure you know what your product means when it uses one of these terms!

The Windows archive bit is evil and must be stopped. At the very least, backup vendors should give us the option of not using it—without penalty. If the “ready for archiving” bit is set on a file in Windows, it indicates that a file is new or changed, and that it should be backed up in an incremental backup. Once a file is backed up, the archive bit is cleared. Therefore, the first problem with the archive bit is that it should be called the backup bit; backups are not archives.

The biggest problem with the archive bit, however, is that the process assumes only one application will clear the archive bit, when there could actually be several of them. The first backup program to back up the directory clears the archive bit, and the next program does not back up the same files. Suppose a user decides to use ntbackup to back up to CD his files that are on the company’s file server. If he does that, ntbackup clears the archive bit, and the corporate backup system in charge of backing up those files will not back them up when it does an incremental backup. They don’t appear to be in need of backup because the archive bit is not set. This means that any user can defeat the purpose of the entire backup system.

Proponents of the archive bit point out that the archive bit is set on newly installed software, even if the files are old. A backup software package that uses only modification time does not “notice” these files if they’re older than the latest incremental backup, so perhaps what they should be using is a combination of the archive bit and modification time. If either has been changed, the file should be included in an incremental backup.

When backing up Unix systems, there is no archive bit, so backup applications use either mtime (when the contents of the file were last changed) or ctime (when the attributes of the file were last changed). When backing up Windows systems, different backup applications use the archive bit differently. Some use it in conjunction with mtime and ctime. Some use only the archive bit, and others do not use it at all. (Based on what I’m saying about the archive bit, that might not be a bad thing.)

Microsoft has offered an alternative to the archive bit with the change journal, available in Windows 2000 and later. Backup products that support the change journal can consult it to determine which files have changed instead of looking at the archive bit. The change journal is not enabled by default, but it can be enabled using the fsutil usn createjournal command. You need to specify a MaximumSize that’s big enough to hold all the changes that are made in between backups. Since 30 or 40 changes are stored in a single 4 K record, you can store 500,000 changes in a 75 MB journal. (If the change journal isn’t large enough, the oldest changes are deleted from the beginning of the log to make room, so it’s important to make the log large enough.) I suggest you find out the largest number of files you’ve ever had on an incremental backup and then make the log twice that size. The additional integrity this brings to your backup system more than makes up for the space this journal takes up.

A question that I am often asked is, “You want me to back up every night?” What the question really means is, “Even on the weekend?” Nobody’s working on the weekend, right? Right...except for your noisiest customer last weekend. You know the customer I’m talking about: the one who calls your boss instead of the help desk when there’s a problem. And if your boss isn’t in or doesn’t fix the problem fast enough, this customer will call your boss’s boss. Well, last weekend this customer was really behind, so she spent the entire weekend at work, working around the clock on next year’s budget. She finally got it straightened out at about 1:00 a.m. Monday. At around 4:00 a.m., the disk where her home directory resides stopped working. (Everything dies Monday morning, doesn’t it?) You haven’t run a backup since Friday night. Your phone is ringing, and it’s your boss. Any guesses as to what he wants to talk to you about? Do you want to be the one to tell this customer that you could have saved her file, but you don’t run backups on the weekend?

Which Levels Do You Run and When?

There are several schools of thought on this question. The following are some suggested backup schedules.

Weekly schedule: All full/level 0 backups

Table 2-1 contains a backup schedule for the paranoid (not that paranoid is a bad thing). Performing a level 0 backup every day onto a separate volume. (Please don’t overwrite yesterday’s good level 0 backup with today’s possibly corrupt level 0 backup!) If your system is really small, this schedule might work for you. If you have systems of any reasonable size, though, this schedule is not very scalable. It’s also really not that necessary with today’s commercial backup software systems.

Table 2-1. All full backups

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Full/0	Full/0	Full/0	Full/0	Full/0	Full/0	Full/0

Weekly schedule: Weekly full, daily level differentials/level 1s

The advantage to the schedule in Table 2-2 is that throughout most of the week, you would only need to restore from two volumes—the level 0 and the most recent level differential/level 1. This is because each differential/level 1 backs up all changes since the full backup on Sunday. Another advantage of this type of setup is that you get multiple copies of files that are changed early in the week. This is probably the best schedule to use if you are using simple utilities such as dump, tar, or cpio because they require you to do all the volume management. A two-volume restore is much easier than a six-volume restore—trust me!

Table 2-2. Weekly full backups, daily level differentials/level 1s

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Full/0	Diff/1	Diff/1	Diff/1	Diff/1	Diff/1	Diff/1

Weekly schedule: Weekly full, daily leveled backups

If your backup product supports multiple levels, you can use the schedule shown in Table 2-3. The advantage to this schedule is that it takes less time and uses less media than the preceding schedule. There are two disadvantages to this plan. First, each changed file gets backed up only once, which leaves you very susceptible to data loss if you have any media failures. Second, you would need six volumes to do a full restore on Friday. If you’re using a good open-source backup utility or commercial backup utility, though the latter is really not a problem, because these utilities do all the volume management for you, including swapping tapes with an auto-changer.

Table 2-3. Weekly full backups, daily leveled backups

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
Full/0	1	2	3	4	5	6

Weekly schedule: Monthly full, daily Tower of Hanoi incrementals

One of the most interesting ideas that I’ve seen is called the Tower of Hanoi (TOH) backup plan. It’s based on an ancient mathematical progression puzzle by the same name. The game consists of three pegs and a number of different-sized rings inserted onto those pegs. A ring may not be placed on top of a ring with a smaller radius. The goal of the game is to move all of the rings from the first peg to the third peg, using the second peg for temporary storage when needed.^[2]

A goal of most backup schedules is to put changed files on more than one volume while reducing total volume usage. The TOH accomplishes this better than any other schedule. If you use a TOH progression for your backup levels, most changed files are backed up twice—but only twice. Here are two different versions of the progression (they’re related to the number of rings on the three pegs, by the way):

0 3 2 5 4 7 6 9 8 9
0 3 2 4 3 5 4 6 5 7 6 8 7 9 8

These mathematical progressions are actually pretty easy. Each consists of two interleaved series of numbers (e.g., 2 3 4 5 6 7 8 9 interleaved with 3 4 5 6 7 8 9). Table 2-4 uses a schedule to illustrate how this works.

Table 2-4. Basic Tower of Hanoi schedule

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
0	3	2	5	4	7	6

It starts with a level (full) on Sunday. Suppose that a file is changed on Monday. The level 3 on Monday would back up everything since the level 0, so that changed file would be included on Monday’s backup. Suppose that on Tuesday we change another file. Then on Tuesday night, the level 2 backup must look for a level that is lower, right? The level 3 on Monday is not lower, so it references the level 0 also. So the file that was changed on Monday, as well as the file that was changed on Tuesday, is backed up again. On Wednesday, the level 5 backs up only what changed that day, because it references the level 2 on Tuesday. But on Thursday, the level 4 does not reference the level 5 on Wednesday; it references the level 2 on Tuesday.

Note that the file that changed on Tuesday was backed up only once. To get around this problem, we use a modified TOH progression, dropping down to a level 1 backup each week, as shown in Table 2-5.

Table 2-5. Monthly Tower of Hanoi schedule

Day of the week	Week one	Week two	Week three	Week four
Sunday	0	1	1	1
Monday	3	3	3	3
Tuesday	2	2	2	2
Wednesday	5	5	5	5
Thursday	4	4	4	4
Friday	7	7	7	7
Saturday	6	6	6	6

If it doesn’t confuse you and your backup methodology,^[3] and if your backup system supports it, I recommend the schedule depicted in Table 2-5. Each Sunday, you get a complete incremental backup of everything that has changed since the monthly full backup. During the rest of the week, every changed file is backed up twice—except for Wednesday’s files. This protects you from media failure better than any of the schedules mentioned previously. You will need more than one volume to do a full restore, of course, but this is not a problem if you have a sophisticated backup utility with volume management.

“In the Middle of the Night...”

This phrase from a Billy Joel song indicates the usual best time to do backups. Backups should be scheduled in such a way that they do not run during normal business hours. Sometimes you cannot avoid it, but it should not be a regular occurrence. There are two main reasons for this:

Integrity: Unless you work in a 24/7 shop, nighttime is the time when the files are the most stable. (Of course, there could be batch jobs running that are manipulating data and customers accessing your web site, so not all files will be stable.) If you are backing up during the day, files are changing and probably also are open. Open files are more difficult to back up. Some backup packages handle open files better than others, but some cannot back them up at all. Also, if the file is changing throughout the day, you will not be sure what version you actually get on your backup.
Speed: Another reason for not doing backups during the day is that the network is much busier, hence slower, during the day. The throughput of your backups slows significantly when your network is being used for normal traffic. If this is a problem at night as well, you might consider using a special network just for your backups. Doing backups during the day can significantly affect the speed of your other applications, and it is not good practice to regularly slow down your systems while people are using them.

Tip

Of course, in today’s global and Internet economy, “night” is relative. If you are in a shop in which the systems are accessed 24/7, you have to do things quite differently. You may want to look at Chapter 8 to see what vendors are doing to help meet this type of challenge.

Whew!

What’s that? You think that I’m a mean and vicious person who is out to give you nightmares for the next week? You have no idea how you would get that information if you needed it? You say that you’re going to lose sleep for a while? Good! Better to have lost sleep than to have lost data. One of the main purposes of this book is to scare you. A complacent person in charge of backups is a dangerous thing. The preceding scenario includes several Catch-22 situations and wipes out data that is not normally caught by standard backups.

Deciding How to Back Up

Once you’ve decided when you’re going to back up, you have to decide how you are going to back up the data. But first, look at what types of problems you are protecting yourself from.

Be Ready for Anything: 10 Types of Disasters

As stated earlier, how you want to do your restores determines how you want to do your backups. One of the questions that you must ask yourself is, “What are you going to protect yourself from?” Are the users in your environment all “power users” who use their computers intelligently and never make dumb mistakes? Would your company lose a lot of essential data if the files on your users’ PCs were accidentally deleted? If a hurricane took out your whole company, would it be able to continue doing business? Make sure that you are aware of all the potential causes for data loss, and then make sure your backup methods are prepared for all of them. The most exhaustive list of potential causes of data loss that I have seen is in another O’Reilly book called Practical Unix and Internet Security by Simson Garfinkel and Gene Spafford. Their list, with my comments attached, follows:

User error: This has been, by far, the cause of the biggest percentage of restores in every environment that I have seen. “Hey, I was sklocking my flambality file, and I accidentally pressed the jankle button. Can you restore it, please?” This one is pretty easy, right? What about the common question: “Can you restore it as of about an hour ago?” You can do this with continuous data protection systems and snapshots, but not if you’re running backups once a night.
System-staff error: This is less common than user error (unless your users have root or administrator privileges), but when it happens, oh boy, does it happen! What happens when you newfs your database’s raw device or delete a user’s document folder? These restores need to go really fast, because they’re your fault. As far as protecting yourself from this type of error, the same is true here as for user errors: either typical nightly backups or snapshots can protect you from this .
Hardware failure: Most books talk about protecting yourself from hardware failure, but they usually don’t mention that hardware failure can come in two forms: disk drive failure and systemwide failure. It is important to mention this because it takes two entirely different methods to protect yourself from these failures. Many people do not take this into consideration when planning their data protection plan. For example, I have often heard the phrase, “I thought that disk was mirrored!” when a drive or filesystem is corrupted by a system panic. Mirroring does not protect you from a systemwide failure. As a friend used to say, if the loose electrons floating around your system decide to corrupt a drive or filesystem when your system goes down, “mirroring only makes the corruption more efficient.” Neither do snapshots protect you from hardware failure—unless you have the snapshot on a backup volume.
Disk drive failure: Protecting your systems from disk drive failure is relatively simple now. Your only decision is how safe you want to be. Mirroring, often referred to as RAID 1, offers the best protection, but it doubles the cost of your initial drive and controller hardware investment. That is why most people choose one of the other levels of Redundant Arrays of Independent Disks (RAID), the most popular being RAID 5, with RAID 6 gaining ground. RAID 5 volumes protect against the loss of a single drive by calculating and storing parity information on each drive. RAID 6 adds more protection by storing parity twice, thus allowing for the failure of more than one drive.
Systemwide failure: Most of the protection against systemwide failure comes from good system administration procedures. Document your systems properly. Use your system logs and any other monitoring methods you have at your disposal to watch your systems closely. Respond to messages about bad disks, controllers, CPUs, and memory. Warnings about hardware failures are your chance to correct problems before they cause major disasters. Another method of protecting yourself is to use a journaling filesystem. Journaling treats the filesystem much like a database, keeping track of committed and partially committed writes to the filesystem. When a system is coming up, a journaling filesystem can roll back partially committed writes, thus “uncorrupting” the filesystem.

Tip

The Windows change journal does not make NTFS a journaling filesystem in this sense. It contains only a list of files that have been changed; it does not actually contain the changes. Therefore, it cannot roll back any changes.

Software failure: Protecting yourself from software failure can be difficult. Operating system bugs, database bugs, and system management software bugs can all cause data loss. Once again, the degree to which you protect yourself from these types of failures depends on which type of backups you use. Frequent snapshots or continuous data protection systems are the only way to truly protect against losing data, possibly a lot of data, from software failure.
Electronic break-ins, vandalism, and theft: There have been numerous incidents of this in the past few years, and many have made national news. If you do lose data due to any one of these, it’s very different from other types of data loss. While you may recover the data, you can never be sure of what happened to the data while it wasn’t in your possession. Therefore, you need to do everything you can to ensure that this never happens. If you want to protect yourself from losing data in this manner, I highly recommend reading the book from which I borrowed this list, Practical Unix and Internet Security, by Simson Garfinkel and Gene Spafford (O’Reilly).
Natural disasters: Are you prepared for a hurricane, tornado, earthquake, or flood? If not, you’re not alone. Imagine that your entire state was wiped out. If you are using off-site storage, is that facility close to you? Is it prepared to handle whatever type of natural disasters occur in your area? For example, if your office is in a flood zone, does your data storage company store your backups on the first floor? If they’re in the flood zone as well, your data can be lost in one good rain. If you really want to ensure yourself against a major natural disaster, you should explore real-time, off-site storage at a remote location, discussed later in this chapter in the section “Off-Site Storage.”
Other disasters: I remember how we used to test our disaster recovery plan at one company where I worked: we would pretend that some sort of truck blew up on the street that ran by our data center. The plan was to recover to an alternate building. This would mean that we would have to have off-site storage of media and an alternate site that was prepared to accommodate all our systems. A good way to do this is to separate your production and development systems and place them in different buildings. The development systems can then take the production systems’ place if the production systems are damaged, or if power to the production building is interrupted.
Archival information: It is a terrible thing to realize that a rarely used but very important file is missing. It is even more terrible indeed to find out that it has been gone longer than your retention cycle. For example, you keep your backups for only three months, after which you reuse the oldest volume, overwriting any backups that are on that volume. If that is the case, any files that have been missing for more than three months are impossible to recover. No matter how insistent the user is about how important the files are, no matter how many calls he makes to your supervisors, you will never be able to restore the files. That is why you should keep some of your backups a little bit longer. A normal practice is to set aside one full backup each month for a few years. If you’re going to keep these backups for a long time, make sure you read the following sidebar “Are You Keeping Your Archives Too Long?” and the “Backup and Archive” section in Chapter 24.

I suppose I’ve heard thousands of administrator-error horror stories, like people typing rm -r /* . I remember a guy who wanted to delete a junk file in /bin called ?*&(&^JI($SF ))FS%$#T, or something like that. He typed rm /bin/?* (which deleted all the files starting with any character—that’s right—all of them). But there’s one story that I witnessed firsthand that still makes me laugh.

A consultant was given the task of cleaning up our home directories. Apparently, my company was very good about deleting logins for people who had left the company, but we weren’t very good about deleting their home directories. The consultant wrote a program that basically did the following:

cd into /home1

find, looking for directories that did not match an entry in the password file and were not owned by root or administrator

rm -r that directory

Each user’s home directory was located under a directory that was the first letter of her login. For example, the home directory for cpreston was in /home1/c/cpreston. The scenario went something like this. The idea was that /home1/c would be owned by root and thus would not be deleted. Unfortunately, over the years, an administrator or two would cd into /home1/c/cpreston and try to correct an ownership problem. To do that, the administrator would type chown cpreston .*. Well, if you’ve ever done that as root, you know that .* includes .., which in this case would be /home1/c. Thus, the /home1/ c ends up being owned by me!

The consultant did not foresee this and so would interpret /home1/c as a user’s home directory and look for the user called “c” in the password file. Of course, there was no such user, so the program said rm -r /home1/c. I’m not sure when my friend realized what was happening, but I do remember being on my way out the door and getting a weird phone call. “How were the backups of /home1 last night?” my friend asked—very sheepishly and very mysteriously. “Fine, as always,” was my response, “Why?” There’s something beautiful about the power that the backup guy yields at that magic moment when someone really needs some files restored. Up to that point, you’re the guy who comes in early and stays late, watching the backup drives spin. In one moment, you’re transformed into the most important person he knows! Cool.

Automate Your Backup

If you work in a shop with a modest budget, you probably looked at this heading and said, “Sure, if I could afford it.” Although automation that involves expensive jukeboxes and autochangers is nice, that is not the type of automation I am talking about. There are two types of automation. One type allows your backups to complete an entire cycle without requiring any manual intervention from you, such as ejecting and loading new volumes. This type of automation can make things much easier but can also make them much more expensive. If you can’t afford it, a less expensive alternative is to have your backup system notify you when you need to do something manually. At the very least, it should notify you if you need (or forgot) to change a volume. If things aren’t going right, you need to know. Too many times people look at their backup logs only when they need to do a restore. That’s when they find out that their backups have failed for days or weeks. A slightly intelligent backup system could email you or page you if things don’t go the way you expect them to go.

The second type of automation is actually much more important. This type of automation refers to how your backups “think.” Your backup process should know what to back up without you telling it. If a DBA installs a new database, your backups should know about it. If a system administrator installs a new drive or filesystem, your backups should automatically include it. This is the type of automation that is essential to safe backups. A good backup system should not depend on a human brain to remember to do something.

Some governments have laws and regulations that govern how long certain types of data are allowed to be kept in a company’s files. We’re not talking about regulations that say you must keep data for a certain number of years. We’re talking about a regulation that says you must delete data after a certain number of years. For example, you may be told that your personnel department can keep disciplinary paperwork for only two years. If an employee believes that her chances for a promotion are reduced because of a disciplinary action that is more than two years old, she can sue for damages. Many lawsuits have been filed based on these laws.

What happens when the disciplinary action “paperwork” is actually a file on someone’s computer? The laws extend to the computers too, and the files must be deleted. But what if that file is on an archive volume that is being kept forever? Many companies have backup policies that dictate that one volume per system per year is kept “forever.” In recent years, some companies have lost lawsuits because of policies like this.

The only way around this is to exclude from regular backups any directories that contain this type of information and archive them using a different schedule that conforms to the document retention laws of your state. I admit this is a pain. You will never read that I think that doing something special for anything is a good thing, but in these litigious times, this issue should not be overlooked.

Plan for Expansion

Another common problem happens as a backup system grows over time. What works for one or two boxes doesn’t necessarily work for 200. As the volume of data grows, the need for a standardized backup system becomes greater and greater. This is a problem because most administrators, as they are writing their shell script to back up five or six boxes, do not think ahead to the time when there may be many more. I can remember my early days as the backup guy. I had 10 or 11 systems, and the “monster” was an Ultrix box. It was “huge,” we said in those days. (It was almost 8 gigabytes!) The smallest tape drive we had was a 10 GB (with compression) Exabyte. We used the big 10 GB tape drive for the 8 GB system. We had what I considered to be a pretty good in-house backup script that worked without modification for two years.

Then came the HPs. The smallest system was 20 GB, and the biggest was much bigger than that. But these big systems came with a little 2 GB (4 with compression) DDS drive. Our backup script author never dreamed of a system that was bigger than a tape. One day I woke up, and our system was broken. I then spent months and months hacking up that shell script to support splitting the drive or filesystem into two tapes. Eventually, I gave up and bought a commercial product. My point is that if I had thought of that ahead of time, I might have been able to overcome the limitation without losing so much sleep.

When you are designing your backup system—or your data center, for that matter—plan on your systems getting bigger and more numerous. Plan for what you will do when that happens—trust me, it will happen. It will be much better for your mental health (not to mention your job security) if you can foresee the inevitable and plan for it when you design the system the first time. Your backup system is something that should be done right the first time. And if you spend a little time dreaming about how to break it before you design it, you can save yourself a lot of money in antacids and sleeping pills.

Don’t Forget Unix mtime, atime, and ctime

Unix, Linux, and Mac OS systems record three different times for each file. The first is mtime , or modification time. The mtime value is changed whenever the contents of the file have changed, such as when you add lines to a logfile. The second is atime , or access time. The atime value is changed whenever the file is accessed, such as when a script is run or a document is read. The last is ctime , or change time. The ctime value is updated whenever the attributes of the file, such as its permissions or ownership, are changed.

Administrators use ctime to look for hackers because they may change permissions of a file to try to exploit your system. Administrators also monitor atime to look for large files that have not been accessed for a long time. (Such files can be archived and deleted.)

Backups change atime

You may be wondering what this has to do with backups. You need to understand that any backup utility that backs up using the filesystem modifies atime as it reads the file to back it up. Almost all commercial utilities, as well as tar , cpio, and dd,^[4] have this feature. dump reads the filesystem via the raw device, so it does not change atime.

The atime can be reset—with a penalty

A backup program can look at a file’s atime before it backs it up. After it backs up the file, the atime obviously has changed. It can then use the utime system call to reset atime to its original value. However, changing atime is considered an attribute change, which means that it changes ctime. This means that when you use a utility such as cpio or gtar that can reset atime, you change ctime on every file that it backs up. If you have a system that is watching for ctime changes, it will think that it’s found a hacker for sure!

Make sure that you understand how your utility handles this issue.

Don’t Forget ACLs

Windows files stored on an NTFS filesystem and some files stored on modern Linux filesystems use access control lists (ACLs) to grant or restrict permissions to users. ACLs say who can read, write, execute, modify, or have full control over a file. Figure 2-1 shows an example of such ACLs.

Figure 2-1. Access control list example

You need to investigate how your backup product is handling ACLs. The proper answer is that they are backed up and restored. This is a feature common with commercial products, but unfortunately, not all open-source products do this. Make sure you look into this when evaluating open-source tools.

Don’t Forget Mac OS Resource Forks

Mac OS files stored in MFS, HFS, or HFS Plus filesystems have two forks: the data fork and the resource fork. The data fork contains the actual data for the file, such as its text. The resource fork contains related structured data, such as offsets, menus, dialog boxes, and icons for the file. The two forks are tightly bound into a single file. While they are typically used by executables, every file can have a resource fork, and other applications can use it as well. For example, a word processing program may store a file’s text in the data fork and the file’s images in the file’s resource fork.

These resource forks, like Windows ACLs, need to be backed up, and not all backup products back them up properly. Make sure you investigate what your backup system does with the data fork and resource fork.

Keep It Simple, SA

K.I.S.S. Have you seen this acronym before? It applies double or triple to backups. The more complicated your backup scheme is, the more likely it is to fail. If you do not understand it, you cannot implement it. Remember this every time you consider adding a new bell or whistle to your backup system. Every change puts your data at risk. Also, every change might make your backup system that much more complex—and more difficult to explain to the new backup person. One of the heads of support for a commercial backup product said that he sees the same thing over and over again. One person gets to know the software really well and writes various scripts to automate this and that. Backups become a well-oiled machine—until they are turned over to the trainee. The trainee doesn’t understand all the bells and whistles, and things start breaking. All of a sudden, your data is in danger. Keep that in mind the next time you think about adding some cool new feature to your backup script.

This next comment also relates to the previous section about “thinking big.” One of the common judgment errors is to not automate in the beginning. It’s so much easier to just put a hardcoded include list in a file somewhere or put it in the cron or scheduled task entry itself. However, that creates many different backup methods. If each box has its own special customized backup system, it is very hard to monitor your backups and explain them to the new person.

Tip

Remember, special is bad. Just keep saying it over and over again until you believe it.

It’s not such a big deal when you have two or three systems, but it is when you grow to 200 systems. If you have to remember every system’s idiosyncrasies every time you look at your logs, things inevitably get out of control. Exceptions for each system also can mean that things get overlooked. Do you remember that nine months ago you excluded /home* on apollo? I hope so, if apollo just became your primary NFS server, and it now has seven home directories.

If you cannot explain your backups to a stranger in less than a few hours, things are probably too complex. You should look at implementing things like centralized logging, standardized backup scripts, and some level of automation.

The IP address of the backup server for a large software company was constantly changing to different, seemingly random IP addresses. The only identifiable pattern was that each new IP address the backup server would be assigned would be an IP address of one of the backup clients. Support cases were opened with vendors and all engineers were working 24/7 to resolve it, yet nobody could figure it out.

It turned out that a backup operator assigned to resolving backup issues was troubleshooting using the standard troubleshooting procedures for the group. But the new backup operator mixed up a few commands, so when trying to do basic name resolutions for the backup hosts (nslookup hostname), the command issued became ifconfig -a hostname instead. This changed the IP address of the backup server to whatever host was having backup issues, at random times of the day, and only on the days that operator was working.

Jorgen Lie

Storing Your Backups

It doesn’t do any good to make really good backups only to have your backup volumes destroyed, lost, or misplaced. You need to have a well-defined process for storing your media.

Storage in General

If you’ve read this far, you know that I consider your backups very important. If your backups are important, isn’t the media on which they reside just as important? That goes without saying, right? Well, you’d never know it from most volume “libraries.” Volume “piles” is probably a more accurate term. How many computer rooms have you seen that have volumes spread out all over the place? They get stacked, piled, fall behind the systems, and a tape cartridge works really well as a coaster for a coffee mug. (We wouldn’t want to get any coffee rings on the new server, right?)

Have you ever really needed a volume and couldn’t find it? I’ve been there. It’s a horrible feeling to know that you’ve got the file on a volume, but can’t find the darn volume! Why, then, do we treat our backup volumes like so much dirty laundry? Organize your backup volumes! Label them, catalog them, give them unique names or numbers, and put them in some sort of logical order in some kind of storage container. Do it, or the backup demon will come to haunt you!

Tip

Your ability to perform a large recovery quickly is directly related to how well you organize your media.

On-Site Storage

What about that media cabinet that you’re using for your on-site volume storage? You don’t have one, you say? You’re using a file cabinet, you say? Well, use something, but if you can afford it, a number of companies make storage containers for media. They also make cabinets that can withstand fire. Spend the money; you’ll be glad you did. Doing a restore is so much less stressful when you can find the volume with no problem. Remember, though, that fireproof does not mean heat-proof. These types of media safes are meant to withstand brief fires that are quickly extinguished by a sprinkler system. If a fire burns for a long time right next to the container or raises the temperature in the room significantly, the volumes may be no good anyway. (This is another good reason why you also must store volumes off-site.)

Have the most well-organized person in your office design your media storage system. Here’s an idea. Ask your best administrative person to take a look at your storage system and compare it to his filing cabinet. Explain that you want an honest evaluation.

12,000 gold pieces

A financial institution where I once worked had an inventory of more than 12,000 pieces of media, and we never lost one. How did we do it, you ask? We treated every volume as if it were a piece of gold. Our inventory system was built on a number of things:

Each volume had a unique numeric identifier.
This number was in the form of a bar code placed on every volume. (Labeling more than 500 5¹/₄-inch original installation floppies that came with our AT&T 3b2/1000s was no joy, I assure you, but we did it with the help of a team of temps!)
Each volume’s number, name, purpose, media type, date used, and location were stored in an Informix database.
Every volume movement was tracked by that database. When a volume was taken to another building for a backup or restore, that movement was recorded in the database. If a volume was sent to our off-site storage vendor, that was stored in the database. If an administrator borrowed a backup volume or installation CD, it was recorded in a field called “Loaned to:”.
There was a manual log for when we moved media out of the media library momentarily for restores. For daily, high-volume moves, we used a bar code scanner with a shell script that automatically updated the database.
We did a complete inventory every other quarter and a spot-check inventory once a month. If the spot-check inventory turned up too many errors, it was time for another full inventory.
During the inventory we checked every volume against a printout of the database and every entry in the printout against an actual volume. (The latter half of the inventory consisted of hunting down errant administrators who had squirreled away backups or installation media in their drawers.)
The volumes were stored in Wrightline media cabinets and were behind locked doors. Only the backup operators had access to the volumes. (These were the same operators who were held responsible if something came up missing.)
The inventories were called “self-audits,” and there was also an annual internal audit by the audit department, as well as the external audit by the Office of the Comptroller of Currency. They would comb through our logs, looking for inconsistencies. They had a knack for finding entries that looked a little weird and saying, “Let me see this one....”
This entire process was thoroughly documented, and the institution is still following these procedures to this day, although it has probably improved them a bit.

The OCC takes this whole issue of data protection very seriously. (The OCC, by the way, is the group that has the power to say you can no longer be a bank. You want to make sure that they are happy with your procedures.)

Off-Site Storage

Once you have organized the media that you are storing on-site, it’s time to consider off-site storage. There are two ways to store your data off-site:

Media vaulting (they hold your tapes)
Electronic vaulting (no tapes)

The latter can be expensive, but not as expensive as some people think. It is also much easier to use during a disaster, and you can’t lose tapes if there aren’t any tapes to lose. That is, of course, what off-site storage is meant to prepare you for—the destruction of your media and/or the building that holds it. If you have a complete set of backups in another location, you will be able to recover from even the worst local disaster.

Choosing a media vaulting vendor

Choosing a media vaulting vendor is as important a task as choosing your backup software. Choosing the wrong vendor can be disastrous. You depend on that vendor as your last line of defense, which is why you are paying them. Therefore, their storage and filing procedures need to be above reproach. They need to be better than the scenario I described in the “12,000 gold pieces” section earlier in this chapter. Their movement-tracking procedure must be free of holes. Here is a list of things to consider when choosing an off-site storage vendor:

Individual media accountability: The first media vaulting vendor I ever used stored all of my volumes inside cases. They never inventoried the individual pieces of media. It was up to me to know which volume was in which case. When I needed a volume from one of the cases, they had to go in and get it. Once that was done, there was no log of where that volume actually existed. This is referred to as container vaulting. Most media vaulting companies also offer individual media vaulting. This method ensures that every volume is being tracked.
Bar-coded, location-based inventory: Again, each volume should have a bar code that allows your storage vendor to scan every volume in and out. They should scan volumes into their vault when they arrive and scan them out when they give them back to you.
Electronic double check: If you are keeping track of every volume’s location, and your vendor is too, you should double-check each other. One or both of you can print out an export of your database that shows volume locations. You can write a program that cross-checks the location of every volume against the other inventory. I can’t tell you how many times such a program has saved me. It’s great to find an error when it happens, instead of weeks later when you need a volume that got misplaced.

Testing your chosen vendor

See if your vendor is on their toes. One tricky thing you can do is to see if they leave you alone in the vault. You are a customer of this company, so ask them if you can do an inventory of your media alone. See if they allow you unrestricted access to the inside of the vault. If they leave you alone inside the vault with no supervision, you have access to other companies’ media. That means that at certain times, other companies may have access to your media. Run, don’t walk, away from this company.

Make surprise inspections. Make spot checks. Ask for random volumes back, and see how quickly they can find them. Ask for volumes you just sent them. Volumes in the process of being inventoried are the hardest to find, but they should be able to do it. If you regularly send them five volumes a day with an inventory, put four volumes in one day, but list five on the inventory. See if they notice. If they don’t, raise a ruckus! Their procedures should protect you from these types of human errors. If they don’t, those procedures need to be improved. Be unpredictable. If you become predictable, you may be overlooked. Keeping them on their toes will make them remember you—and how important you think your volumes are. (By the way, your ability to make surprise inspections and spot checks should be spelled out in your contract. Make sure that it is OK for you to do this. If it is not...well, you know what to do.)

Vendors store two types of volumes: those that rotate in and out and those that stay there indefinitely. As you rotate the cyclical volumes in and out, they are inventoried. Your archive volumes are another story. If a volume has been there for two years and has never been touched, how do you know that it’s OK? You should make a full inventory of those volumes at least once, preferably twice, every year.

Tip

Send the original, keep the copies. One of the things that you should regularly test is your copy procedure. If you are sending volumes off-site, some backup products give you the option of sending the originals or copies. If you can, send the originals. When it comes time for a restore, use your copy. If things go wrong, you can always go get the original. This process validates your copy procedure every time you do a restore. You can correct flaws in the process before disaster strikes. I remember several instances when a volume was eaten in a drive, or had soda spilled on it, and we needed that off-site copy really badly. That is the wrong time to find out your copy procedure is no good!

Electronic vaulting

Electronic vaulting is becoming quite popular. It can be expensive, but it’s a beautiful thing. If you can afford it, I highly recommend it. The premise is that your backups are sent directly to a storage system at the electronic vaulting vendor. One question you need to ask yourself is, “What happens if they burn to the ground?” All your data could be lost. Don’t let this happen. Make sure that this storage company is not the only location for your backed-up data. In addition, make sure that you know how you’re going to do a large restore. While a small network link may be large enough to do a continuous incremental backup, it’s probably not large enough to do a 100 GB restore. If this is a concern, ask your electronic vaulting vendor about a local recovery appliance.

Testing Your Backups

I wish there were enough to say about this to make it a separate chapter, because it’s that important. I can’t tell you how many stories I have heard about people who waited until they needed a major restore before they tested their backups. That’s when they found out that they’d been using the wrong device or the wrong blocking factor, or that the device had I/O errors. This point cannot be stated strongly enough. If you don’t test your backups, you are guaranteed to get a surprise sooner or later.

Test Everything!

It is important to test every type of restore. If you are testing filesystem backups, make sure you:

Restore many single files. Can you find the needle in the haystack?
Restore an older version of a file.
Restore an entire drive or filesystem, and compare your results with the original. Are they the same size, and so on?
Pretend that an entire system is down, and try to recreate it.
Pretend that a particular volume is bad, and force yourself to use an alternate backup.
Retrieve a few volumes from your off-site storage vendor.
Pretend that your backup server is destroyed, and try to recover from that. (This one’s tough!) This test is extremely important if you are using an open-source or commercial backup utility. Some products do not plan for this well, and you can find yourself in a real Catch-22 situation.

If you are testing database restores, make sure you:

Restore part of your database, pretending that you lost only one data file or disk drive, if this option is available.
Restore the entire database onto another server; this is where you learn about files that you are not including.
Restore the database up to a point in time, earlier than the present time (this is helpful practice for recovering from a DBA or user error).
Pretend that last night’s backup failed, and force yourself to use an older backup. Theoretically, if you have saved all your transaction logs to a backup volume, you should be able to use a backup that is weeks old and roll it forward to the present time using those logs. This is another strong argument for using transaction logs.

Test Often

As I said earlier, sit around one day with some really pessimistic people, and ask them to dream up scenarios for you to test. Test your ability to recover from each of these scenarios on a regular basis. What works this month might not work next month. The only thing that is guaranteed to remain constant is change. One suggestion is to create a list of recovery procedures and randomly test a subset of them every month. Management changes, hardware changes, networks change, and OS and database versions change. Every change you know about should make you want to perform a test of the affected area.

Monitoring Your Backups

If you are not monitoring your backups, they are not doing what you think they are doing—guaranteed. This is one pot that will not boil if you don’t watch it. Every backup should have a log that is examined daily. This can be automated as well. Here are some examples:

Give me a summary.

dump gives a whole bunch of messages that I couldn’t care less about, Pass I, Pass II, % done, and so on. When I’m monitoring the dump backups of hundreds of drives or filesystems, most of that is so much noise. What I really want to see is what got dumped, where it went, when it went, what level it was, and the ever-popular DUMP IS DONE message. To get a summary of just these lines, the first thing I do is use grep -v to exclude the phrases I don’t want, leaving only a few lines. This is much easier to review. This technique can also be applied to other Unix, Linux, and Mac OS backup commands.

Show me anything weird.

You can do this in either of two ways. If you know the phrases that show up when things go wrong, grep for those. Another way is to use grep -v to remove all lines you’re expecting and see what’s left. If there’s nothing, great! If there are lines left over, they are probably errors. You may see lines such as I/O error, Write error, or something else you don’t like to see in your backups.

If you want to apply this to a Windows task, you need to take advantage of some Unix emulators, such as Cygwin, UWIN, or GnuWin32, to allow you to run grep and other shell commands on a Windows system.

You Can Always Make It Better

I don’t care how good your backups are; they can always be better. You could spend every waking hour tweaking and improving every piece of your backup program and know everything there is to know about backups, and they could still be better. My backups will never be good enough. There’s always a new bell or whistle on some other backup package, a bigger or smarter jukebox, a faster backup drive, or some scenario I thought of that I’m not covering. You must realize, however, that every change you make has a potential for data loss. A common thread that you will find in this book is that every time the human being enters into the equation, things can go wrong. You may be the best shell or Perl hacker in the world, and you will still make mistakes.

If It’s Not Baroque, Don’t Fix It

Baroque needed fixing. Come on! One constant tempo? And the harpsichord? It had to go. Thankfully, Bartolomeo Cristofori invented the piano in 1709 and gave us an instrument that could play loud (fortissimo) and quiet (piano)—hence its name. Shortly after that, the Classical period started, and music started changing in tempo and feeling.

But if it’s not broken, don’t fix it! You’ve heard it before, but given the risk that each change makes, it goes double in the backup world. As you read this book or some magazine, or talk to other administrators, you will undoubtedly come up with a list of things that you wish you were doing. Concentrate on the holes, or scenarios that your backup and recovery plan just does not cover. Worry about the fact that none of your volumes are stored off-site before you think about working on that cool menu program you’ve been wanting to write for your restores. Make sure you’re covering all the bases before you start redecorating the stands. Before you consider making a new change, ask yourself if something else is more important and if the change is really necessary and worth the risk.

Following Proper Development Procedures

Don’t make a new change on your backup system and then roll it out to all your machines at once. Test it on a development system, or better yet, on a system that you don’t normally back up. That way you aren’t putting any backups in jeopardy. Another good practice is to test the change in parallel with what you’re already doing. The bigger the change, the more important it is to do a parallel conversion. This is especially true if you’re using a new method rather than just enhancing your current one. Don’t stop using your old method until you’re sure that the new one works! Follow a plan similar to this:

Test backup changes somewhere where they really won’t hurt anybody if they do something like, oh, crash the system!
Test the operation on a small scale on one system, using it in the same manner as you would in production. For example, if you are going to do both remote and local backups with this program, test both on a small scale.
Try to simulate every potential error the system might encounter:
- Eject a volume in the middle of the backup.
- Write-protect a volume.
- Reboot the system you are backing up while it is backing up.
- Drop the network connection, and power down a disk drive.
- Know the system and the errors for which it is testing, and simulate each one to test that section of your system.
Test on a small number of systems, preferably in parallel with your current method.
When you roll it out to all systems, definitely do so in parallel. One of the ways you can do this is to squeeze all your backups onto as few volumes as you can, then use the leftover drives to do the new backup in parallel. Your network guys might hate you, but it’s really the only way to do a true parallel conversion. When I converted to my first commercial backup utility, I ran in this mode for almost a year.
Only after you’ve tested and thoroughly documented your new system should you turn off the old method. Remember to keep documentation and programs around to restore data from the old system until all the old volumes have been recycled into the new system.
Also consider your older backup volumes. If you have volumes that are five years old, are you going to be able to read them on a new vendor’s backup solution? Will you even be able to read them in version 14 of your current software if your company began writing the archive volumes in version 2? Will the media itself even be readable?
This gets into a whole other subject: media life. Even if you could still theoretically read the volumes 12 versions later, the volumes are probably bad. If you have long-term archives, you need to make fresh copies of them at some set interval by copying older tapes to newer tapes.

Unrelated Miscellanea

We were going to call this section “Oh, and by the way,” but that seemed like a really weird heading.

Protect Your Career

One of the reasons that backups are unpopular is that people are worried that they might get fired if they do them wrong. People do get in trouble when restores don’t go right, but following the suggestions in this section will help you protect yourself from “recovery failure fallout.”

Self-preservation: Document, document, document

Have you ever tried to go on vacation? If you’re the only one who understands the restore process or the organization of your media, you can bet that you will be called if a big restore is required. Backups are one area of system administration in which inadequate documentation can really get you in trouble. It’s hard to go on vacation, get promoted, or do anything that would pull you away from the magical area that only you know. Your backups and restores should be documented to the point that any system administrator can follow them step by step in your absence. That is actually a good way to test your documentation: have someone else try to use it.

The opposite of good documentation is, of course, bad or nonexistent documentation. Bad documentation is the surest way to help you find a new job. If you do ever manage to take a real vacation in which you don’t carry a beeper, check your voice mail, or check your email—in short, watch out: Murphy’s Law governs vacations as well. You can guarantee that you, or more accurately, your coworkers, will have a major outage that week. If they crash and burn because you left them no guidelines for how to perform a restore, they will be looking for you when you return. You will not be a popular person, and you just might find yourself combing through the want ads.

On a number of occassions, a lack of documentation caused me to lose personal time. I remember one vacation during which I spent two to three hours on the phone every day. I remember spending long nights in computer rooms because no one knew which button to press next. But none of those memories is as strong as the time when my wonderful daughter, Nina, was born. Right about now, you’re probably saying “Aaah, that’s sweet.” It’s not what you think. Yes, she and my other daughter, Marissa, have given me a whole other reason to get up every day, but that’s not what this story’s about.

The hospital in which my wife gave birth was about two blocks from my office building. I knew that. My coworkers knew that. (Anybody who looked out the window knew that!) The day Nina was born, we lost a major filesystem. I knew it was on a backup volume, and I knew I was off duty. I left my beeper, which is normally welded to my side, at home. I did not call in to work. I knew the process was documented. The problem was that they weren’t reading the documentation. “Call Curtis!” I was standing in my wife’s hospital room, talking about our wonderful child, when the phone rang. Those guys tracked me down and called me in the hospital! They asked me to come in, but because I knew the system was documented well, the answer was no! (I think I actually hung up without saying another word.) This is an example of the lengths to which people will go to find you if you don’t have proper documentation or if they have not been shown how to use it.

Documentation is also an important method of letting your internal customers know what you are doing. For example, if you skip certain types of files, drives, or filesystems, it is good if you let people know that. I remember at least one very long conversation with a user who really didn’t want to hear that I didn’t back up /tmp: “I never knew that tmp was short for temporary!”

Strategy: Make backups an integral part of the installation process

When a new system comes in the door, someone makes sure that it has power. Someone is responsible for the network connection, assigning an IP address, adding it to the NIS configuration, and installing the appropriate patches. All those things happen because things don’t work if they don’t happen. Unfortunately, no one notices if you don’t add the machine to your backup list. That is, of course, until it crashes, and they need something restored. You have the difficult task of making something as “unimportant” as backups become just as natural as adding the network connection.

Tip

A new system coming in the door is usually the best test machine for a complete server recovery/duplication test. Not many people miss a machine they don’t have yet.

The only way that this is going to happen is if you become very involved in the whole process. Perhaps you are a junior person, and you never sit in on the planning meetings because you don’t understand what’s going on. Perhaps you do understand what’s going on but just hate to go to meetings. So do I. If you don’t want to attend every meeting, just make sure that someone is looking out for your interests in those meetings. Maybe an ex-backup operator who goes to the meetings and is sympathetic to your cause. Have briefings with her, and remind her to make sure that backup needs are being addressed or to let you know about any new systems that are coming down the pike. Occasionally, go to the meetings yourself, and make sure that people know that you and your backups exist; hopefully, they’ll remember that the next time they think about installing a new system without telling you. Never count on this happening, though. You’ve got to be ever-diligent, looking for new systems in need of backup.

New installations are not the only thing that can affect your backups. New versions of the operating system, new patches, and new database versions all can break your backups. Most system administrators bring in a new version of their operating system or database and run it on a new box or development box before they commit it to production. Make sure that your backup programs run on the new platform as well. I can think of a number of times that new versions broke my backups. Here are a few examples:

When HP-UX 10 came out, it supported file sizes greater than 2 GB, but the dump manpage said that dump does not back up a filesystem with large files.
Oracle has changed a number of times. Sometimes it maintains backward compatibility; sometimes it doesn’t.
The Windows Encrypting File System couldn’t be backed up by some backup systems when it first came out.
Mac OS versions (post Mac OS X) always seem to be a little ahead of the backup curve. The methods you used to back up or image the last version no longer work in the new version, giving rise to many unofficial versions of tools that actually work.

The longer I think about it, the more stories I come up with. If you’ve been doing this for a while, I’m sure you have a few of your own. Suffice it to say that OS and application upgrades can and do cause problems for the backup person. Test, test, test!

Get the Money Your Backups Need

This final section has absolutely nothing to do with backups. It has to do with politics, budgeting, money, and cost justifications. I know that sometimes it sounds as if I think that backups get no respect. Maybe you work at Utopia, Inc., where the first thing they think about is backups. The rest of us, on the other hand, have to fight for every volume, drive, and piece of software that we buy in order to accomplish this increasingly difficult task of getting it all on a backup volume.

Getting the money you need to accomplish your task can sometimes be very difficult. Once a million dollar computer is rolled in the door and uncrated, how do you tell the appropriate department that the small, standalone backup drive that came with it just isn’t going to cut it? Do you know how many hoops they went through to spend a million dollars on one machine? You want them to spend how much more?

Be ready

The first thing is to be prepared. Be ready to justify what you need. Be ready with information such as:

Statistics on recoveries that you have performed.
Any numbers that you have on what downtime and lost data would cost the company, including any recovery-time-objective and recovery-point-objective requirements from business units (RTO and RPO are covered in more detail in Chapter 24).
Numbers that demonstrate how a purchase would help reduce staff costs.
Numbers that demonstrate how the current backup system is being negatively impacted by growth or new applications.
Cost comparisons between the one-time cost of a newer storage system and the continuing cost of the manual labor required to swap volumes every night (be prepared to explain how a larger jukebox or virtual tape library reduces the chance for human error and how that helps the company as well).
A documented policy that every new gigabyte results in a surcharge of a certain amount of money.

A well-designed presentation of what service your backups will provide and the speed at which you can recover data. (Don’t commit yourself to unrealistic restore times, but if the new system can significantly improve restore times, show that!)

A letter all ready to go, with which your boss is comfortable, that explains very matter-of-factly what the company can expect if it doesn’t provide the funding you need.

Make a formal presentation

The more expensive your solution, the more important it is that you make a formal presentation, especially if you are in a corporate environment. A formal technical presentation has three parts: an executive summary, an overview section that goes into more detail, and a technical specifications section for those who are really interested:

Executive summary: This should be one page and should explain on a very high level what is proposed in the rest of the presentation. Global figures and broad descriptions are good; do not go into too much detail. This is made for the VP, who needs to do the final sign-off but has 20 other presentations just like yours to review at the same time. Basically, state the current problem, and describe your solution.
Overview section: Go into detail in this section. Use plenty of section headings, allowing your readers to read ahead or skim over it if they like. Headings also allow the people who read only the executive summary to look up any specific area they are not clear on. The outline of the general section should match that of the executive summary. You can include references to other publications, such as magazine reviews of a particular product, but do not quote them in detail. If they are relevant, you can attach copies in the technical section. Make sure you demonstrate that you have thought this through and that it is not just a stopgap measure. Put a high-level comparison between the option you chose and the other options available, and explain why you chose the one you did. Describe how it allows for future growth and how much growth it will allow before you must reconsider. Also explain what plans you have for the old methodology and what conversion method you are going to use, such as running both in parallel for a while. Tables are also good. If you can use real numbers, it is much more effective: just make sure you can back them up. If anyone believes that the numbers are made up, it will totally invalidate the report. Try to compare the up-front cost of your solution with the surprise cost of lost data.
Technical specifications: Go wild. If anyone has made it this far, they’re either really interested or a true computer techie just like you! If this report is the cost justification for a new backup drive, find a table that compares the relative cost per megabyte of all the various options. Include hard numbers and any white papers that are included with the proposed product. If you think it is relevant, but possibly too long and boring, this is the place to put it.

Good Luck

The chapters that follow explore in depth the various methods that you may employ to back up your systems, especially open-source tools. Most of these topics are also covered in documentation from the appropriate vendor or open-source team; this book is not meant to be a replacement for that documentation. Here, I try to explain things that are not covered in the documentation and possibly address some subjects more frankly than can a manual provided by the vendor.

Welcome to the world of backups.

Tip

BackupCentral.com has a wiki page for every chapter in this book. Read or contribute updated information about this chapter at http://www.backupcentral.com.