Deciding What to Back Up

Experience shows that one of the most common causes of data loss is that the lost data was never configured to be backed up. The decision of what to back up is an important one.

Plan for the Worst

When trying to decide what files to include in your backups, take the most pessimistic technical person in your company out to lunch. In fact, get a few of them together. Ask them to come up with scenarios that you should protect against. Use these scenarios in deciding what should be included, and they will help you plan the “how” section as well. Ask your guests, “What are the absolute worst scenarios that could cause data loss?” Here are some possible answers:

  • An entire system catches fire and melts to the ground, leaving an unrecognizable mass of molten metal and blackened, smoking plastic.

  • Since this machine was so important, you, of course, had it replicated to another node right next to it. Of course, that machine catches fire right along with this one.

  • You have a centralized server that controls all backups and keeps a record of backup volume locations and what files are on what volumes, and so on. The server that blew up sits right next to this “backup server,” and the intense heat took this system with it.

  • The disastrous chain reaction continues, taking out your DHCP server, NIS master server, NFS home directory server, NFS application server, and the database server where you house the inventory of all your backup volumes with their respective locations. This computer also holds the telephone database listing all service agreements, vendor telephone numbers, and escalation procedures.

  • You haven’t memorized the number to your new off-site storage vendor yet, so it’s taped to the wall next to your backup server. You realize, of course, that the flames just burnt that paper beyond recognition.

  • All the flames set off the sprinkler system and water pours all over your backup volumes. Man, are you having a bad day . . .

What do you do if one of these scenarios actually happens? Do you even know where to start? Do you know:

  • What volume was last night’s backup on?

  • Where you stored it?

  • How to get in touch with the off-site storage vendor to retrieve the copies of your backup volumes?

  • Once you find that out, will your server and network equipment be available to recover?

  • Who you call to get replacement equipment at 2:00 A.M. on a Saturday?

  • What the network looked like before all the wires melted?

First, you need to recover your backup server, since it has all the information you need. OK, so now you found the backup company’s card in your wallet, and you’ve pulled back every volume they had. Since your media database is lost, how will you know which one has last night’s backup on it? Time is wasting . . .

All right, you’ve combed through all the volumes, and you’ve found the one you need to restore the backup server. Through your skill, cunning, and plenty of help from tech support, you restore the thing. It’s up and running. Now, how many disks were on the systems that blew up? What models were they? How were they partitioned? Weren’t some of them striped together into bigger volumes, and weren’t some of them mirroring one another? Where’s that information stored? Do you even have a df output of what the filesystems looked like? Man, this is getting complicated . . .

Didn’t you just install that big jumbo kernel patch last week on three of these systems? (You know, the one that stopped all those network broadcast storms that kept bringing your network down in the middle of the day.) You did make a backup of the kernel after you did that, didn’t you? Of course, the patch also updated files all over the OS drive. You made a full backup, didn’t you? How will you restore the root drive, anyway? Are you really going to go through the process of reinstalling the operating system, just so you can run the restore command and overwrite it again?

Filesystems aren’t picky about size, as long as you make them big enough to hold the data that you restore to them, so it’s not too hard to get those filesystems up and running. But what about the database? It was using raw partitions. You know it’s going to be much pickier. It’s going to want /dev/rdsk/c7t3d0s7 and /dev/dsk/c8t3d0s7, and /dev/dsk /c8t4d0s7 right where they were and partitioned just as they were before the disaster. They also need to be owned by the database user. Do you know which drives were owned by that user before the crash? Which disks were those again? If restoring the root drive included reinstalling the operating system, how will you know what UID the database user was?

It could happen.

Tip

The catch-22 situations above are covered in Part IV.

Take an Inventory

Make sure you can access essential information in the event of a disaster.

Backups for your backups

Many companies have begun to centralize control of their backups, which I think is a good thing. However, once you centralize storage of all your backup information, you have a single point of failure for your entire backup plan. Restoring this server would be the first step in any multisystem outage. For things like media inventory, don’t underestimate the value of an inventory printed on paper and stored off-site. That paper may just get you out of a catch-22. Given the single-point-of-failure factor, the recovery of your backup server should be the easiest and best-documented recovery that you have. You may even want to investigate creating a special dump or tar backup of that data to make it even easier to recover during a disaster.

What peripheral devices did you have?

Assuming you back up /dev on a regular basis, you might have a list of all the device names, but do you know what models they are? If you have all Brand-X 2.9-gigabyte drives, then you have no problem, but many servers have a mixture of drives that were installed over time. You may have a collection of 1-GB, 2-GB, 2.01-GB, 2.1-GB, 2.9-GB, 4-GB, and 9-GB drives, all on the same system. Make sure that you are recording this in some way. Most Unix systems record this already, by the way, usually in the /var/adm/messages file, so hopefully you’re backing that up.

How were they partitioned?

This one can really get you, especially if you have to restore the root drive or a database drive. Both of these drives are typically partitioned with custom partitions that must be repartitioned exactly the same as before for a proper restore to occur. Typically, this partition information is not saved anywhere on the system, so you must do something special to record it. On a Solaris system, for example, you could run a prtvtoc on each drive, and save that to a file. There are scripts that capture much of this information; two of them—SysAudit and SysInfo —are covered in Chapter 4.

How were your volume managers configured?

There are a number of operating-system-specific volume managers out there such as Veritas Volume Manager, Solstice (Online) Disk Suite, and HP’s Logical Volume Manager. How is yours configured? What devices are mirrored to what? How are your multidisk devices set up? Unbelievably, this information is not always captured by normal backup utilities. In fact, I used Logical Volume Manager for months before hearing about the lvmcfgbackup command. (lvmcfgbackup backs up the LVM’s configuration information.) Sometimes if you have this properly documented, you may not need to restore at all. For example, if the operating system disk is crashed, you simply put the disks back the way they were and then rebuild the stripe in the same order, and the data should be intact. I’ve done this several times.

How are your databases set up?

I have seen many database outages. When I ask a database administrator (DBA) how her database was set up, the answer was almost always, “I’m not sure . . .” Find out this information, and record it up front.

Did you document how you set up NFS, NIS, DHCP, etc.?

Document, document, document! There are a hundred reasons to properly document things like this, and recovery from a disaster is one of them. Good documentation is definitely part of the backup plan. It should be regularly updated and available. No one should be standing around saying “I haven’t set up NIS from scratch in years. How do you do that again? Has anyone seen my copy of O’Reilly’s NFS and NIS book?” Actually, the best way to do this is to automate the creation of new servers. Take the time to write shell scripts that will install NIS, NFS, and automounter , and configure them for your environment. Put these together in a toolkit that gets run every time you create a new server. Better yet, see if your OS vendor has any products that automate new server installations, like Sun’s Jumpstart or HP’s Ignite-UX.

Do you have a plan for this?

The reason for describing the earlier horrible scenarios is so that you can start planning for them now. Don’t wait until there’s 20 feet of snow in your front yard before you start shopping for a snow shovel! It’s going to snow; it’s only a question of when. Take those pessimists out to lunch, and let them dream of the worst things that could happen, and then plan for them. Have a fully documented, step-by-step plan for the end of the computer world as you know it. Even if the plan needs a little modification when you actually have to use it, you will be glad you have a starting point. That will be a whole lot better than standing around saying, “What do we do now? Has anyone seen my resume?” (You did keep a hard copy of it, right?)

Know what’s on your boxes!

The best insurance against almost any kind of loss is for the backup/recovery person to be familiar with the systems that he is protecting. If a particular server goes down, you should know immediately that it contains an Oracle database and should be running for those volumes. That way, the moment the server is ready for a restore, so are you. Become very involved in the installation of any new system or database. You should know what database platforms you are using and how they are set up. You should know about any new filesystems, databases, or systems. You need to be very familiar with every box, what it does, and what’s on it. This information is vital, so that you can include any special backups for that type of system .

Are You Backing Up What You Think You’re Backing Up?

I remember an administrator at a previous employer who used to say, “Are we getting this on tape?” He always said it with his trademark smirk, and it was his way of saying “Hi” to the backup guy. His question makes a point. There are some global ways that you can approach backups that may drastically improve their effectiveness. Before we examine whether to back up part or all of the system, let us examine the common practice of using include lists and why they are dangerous. Also, we will cover some of the ways that you can avoid using include lists. What are include and exclude lists? Generically speaking, there are two ways to back up a system:

  • You can tell your backup system to back up everything, except what is in an exclude list, for example:

    Include: *
    Exclude: /tmp /junk1 /junk2
  • You can tell your backup system to back up what is in an include list, for example:

    Include: /data1 /data2 /data3

Looking at these examples, ask yourself what happens when you create /data4 ? Someone has to remember to add it to the include list, or it will not be backed up. This is a recipe for disaster. Unless you’re the only one who adds filesystems and you have perfect memory, there will always be a forgotten filesystem. As long as there are other administrators and there is gray matter in your head, something will get left out.

However, unless you’re using a commercial backup utility, it takes a little effort to say, “Back up everything.” How do you make the list of what systems, filesystems, and databases to back up? What you need to do is look at files like /etc/vfstab (or its equivalent on your operating system) and parse out a list of filesystems to back up. You can then exclude any filesystems that are in any exclude lists that you have.

Oracle has a similar file, called oratab, which lists all Oracle instances on your server.[4] You can use this file to list all instances that need backing up. Unfortunately, Informix and Sybase databases have no such file unless you manually make one. I do recommend making such a file for many reasons. It is much easier to standardize system startup and backups when you have such a file. If you design your startup scripts so that a database does not get started unless it is in this file, then you can be reasonably sure that any databases that anyone cares about will be in this file. This means, of course, that any important databases will be backed up without any manual intervention from you. It also means that you can use the same Informix and Sybase startup scripts on every system, instead of having to hardcode each database’s name into the startup scripts.

How do you know what systems to back up? Although I never got around to it, one of the scripts I always wanted to write was a script that monitored the various host databases, looking for new systems. I wanted to get a complete list of all hosts from /etc/hosts, Domain Name System (DNS), and Network Information System (NIS), and compare it against a master list. Once I found a new IP address, I would try to determine if the new IP address was alive. If it was alive, that would mean that there was a new host that possibly needed backing up. This would be an invaluable script, and would make sure that there aren’t any new systems on the network that the backups don’t know about. Once you found a new IP address, you could use queso to determine what kind of system it is. queso gets its name from an abbreviated Spanish phrase that means “What operating system are you?” It sends a malformed TCP packet to the IP address, and the address’s response to that packet reveals which operating system it is based on. (queso is covered in Chapter 4.)

Back Up All or Part of the System?

Assuming you’ve covered things that are not covered by normal system backups, you are now in a position to decide whether you are going to back up your entire systems or just selected filesystems from each system. These are definitely two different schools of thought. As far as I’m concerned, there are too many gotchas in the selected-filesystem option. Backing up everything is easier and safer than backing up from a list. You will find that most books stop right there and say “It’s best to back up everything, but most people do something else.” You will not see those words here. I think that not backing up everything is very dangerous. Consider the following comparison between the two methods.

Backing up only selected filesystems

Save media space and network traffic

The first argument that is typically stated as a plus to the selected-filesystem method is that you have to back up less data. People of this school recommend having two groups of backups, operating system data and regular data. The idea is that the operating system backups would be performed less often. Some would even recommend that they be performed only when you have a significant change, like an operating system upgrade, patch installation, or kernel rebuild. You would then back up your “regular” data daily.

The first problem with this argument is that it is outdated; just look at the size of the typical modern system. The operating system/data ratio is now significantly heavy on the data side. You won’t be saving much space or network traffic by not backing up the OS even on your full backups. When you consider incremental backups, the ratio gets even smaller. Operating system partitions will have almost nothing of size that would be included in an incremental backup, unless it’s something important that should be backed up! This includes things like /etc/passwd, /etc/hosts, syslog, /var/adm/messages, and any other files that would be helpful if you lost the operating system. Filesystem swap is arguably the only completely worthless information that could be on the OS disk, and it can be excluded with proper use of an exclude list.

Harder to administer

Proponents of piecemeal backup would say that you can include important files like the preceding ones in a special backup. The problem with that is that it is so much more difficult than backing up everything. Assuming you use exclude lists for the regular backups (as discussed before), you have to remember to do manual backups every time you do something major. Also, there is the matter of files like /etc/passwd that change every day. You need to back them up, and they are included in the filesystems that you’ve excluded. That means you have to do something special for them. Special is bad.

Easier to split up between volumes

One of the very few things that could be considered a plus is that if you split up your filesystems into multiple backups, it is easier to split them between multiple volumes. If a dump of your system will not fit on one volume, then it is easier to automate it by splitting it into two different include lists. However, in order to take advantage of this, you will have to use include lists rather than exclude lists; then you are subject to their limitations discussed earlier. I also suggest that if you have systems that are larger than your backup drives, it is a good time to consider a commercial backup utility.

Easier to write a script to do it than to parse out the /etc/vfstab or /etc/oratab

This one is hard to argue against. However, if you do take the time to do it right the first time, you never need to mess with the include lists again. This reminds me of another favorite phrase of mine, “Never time to do it right, always time to do it over.” Take the time to do it right the first time.

The worst that happens? You overlook something!

In this scenario, the biggest benefits are that you save some time spent scripting up front, as well as a few bytes of network traffic. The worst possible side effect is that you overlook the filesystem that contained your boss’s budget sheet that just got deleted.

Backing up the entire system

Complete automation

Once you go through the trouble of creating a script or program that works, you just need to monitor its logs. You can rest easy at night knowing that all your data is being backed up.

The worst thing that happens? You lose a friend in the network department

You may increase your network traffic by a few percentage points, and the people looking out after the wires might not like that. (That is, of course, until you restore the server where they keep their DNS source database.)

Backing up selected filesystems is one of the most common mistakes that I find when evaluating a backup configuration. It is a very easy trap to fall into, because of the time it saves you up front. Until you’ve been bitten though, you may not know how much danger you are in. If your backup setup uses include lists, I hope that this discussion convinces you to rethink that decision .



[4] You can install an Oracle instance without putting it in this file. However, that instance will not get started when the system reboots. This usually means that the DBA will take the time to put it in this file. More on that in Chapter 15.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.229.113