Chapter 4. Archive management: Backing up or copying entire file systems

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Archive management: Backing up or copying entire file systems

This chapter covers

Why, what, and where to archive
Archiving files and file systems using tar
Searching for system files
Securing files with object permissions and ownership
Archiving entire partitions with dd
Synchronizing remote archives with rsync

Through the book’s first chapters, you learned a lot about getting around both safely and efficiently in a Linux environment. You also learned to generate base working environments using the wonders of virtualization. From here on in, I’ll focus on building and maintaining the infrastructure elements you’ll need to get real stuff done.

Building IT infrastructures without a good backup protocol is like mortgaging your home to invest in your brother-in-law’s, can’t-go-wrong, cold fusion invention. You know the odds are that it won’t end well. But before you can properly back up file systems and partitions, you’ll need to understand exactly how file systems and partitions work. After that? What tools are available? When should each be used, and how will you put everything back together again if disaster strikes? Stay tuned.

4.1. Why archive?

Before we get to the why, just what is an archive? It’s nothing more than a single file containing a collection of objects: files, directories, or a combination of both. Bundling objects within a single file (as illustrated in figure 4.1) sometimes makes it easier to move, share, or store multiple objects that might otherwise be unwieldy and disorganized.

Figure 4.1. Files and directories can be bundled into an archive file and saved to the file system.

Imagine trying to copy a few thousand files spread across a dozen directories and subdirectories so your colleague across the network can see them too. Of course, using the proper command-line syntax arguments, anything can be done. (Remember cp from chapter 1? And -r?) But making sure you copy only the files you’re after and not accidentally leaving anything out can be a challenge. Granted, you’ll still need to account for all those files at least one time as you build the archive. But once you’ve got everything wrapped up in a single archive file, it’s a whole lot easier to track. Archives it is, then.

But there are archives, and then there are archives. Which to choose? That depends on the kinds of files you’re looking to organize and on what you plan to do with them. You might need to create copies of directories and their contents so you can easily share or back them up. For that, tar is probably going to be your champion of choice. If, however, you need an exact copy of a partition or even an entire hard disk, then you’ll want to know about dd. And if you’re looking for an ongoing solution for regular system backups, then try rsync.

Learning how to use those three tools and, more importantly, learning what real problems those three tools can solve for you will be the focus of the rest of this chapter. Along the way, we’ll take a bit of a detour to see how to protect the permissions and ownership attributes for the files in the archive as they move through the archive life cycle. Finally, we’ll take a peek at why Linux uses file permissions and file ownership in the first place.

4.1.1. Compression

One more note before we begin. Although the two are often used together, don’t confuse archiving with compression. Compression, as figure 4.2 shows, is a software tool that applies a clever algorithm to a file or archive to reduce the amount of disk space it takes. Of course, when they’re compressed, files are unreadable, which is why the algorithm can also be applied in reverse to decompress them.

Figure 4.2. Object compression by eliminating statistical redundancy and/or removing less important parts of a file

As you’ll see soon, applying compression to a tar archive is simple and doing so is a particularly good idea if you’re planning to transfer large archives over a network. Compression can reduce transmission times significantly.

4.1.2. Archives: Some important considerations

Two primary reasons you’ll want to create archives are to build reliable file system images and to create efficient data backups. This section describes those objectives.

Images

What’s an image? Remember those .ISO files you used to install Linux on a virtual machine back in chapter 2? Those files were images of complete operating systems, specially organized to make it easy to copy the included files to a target computer.

Images can also be created from all or parts of a live, working operating system (OS) so you can copy and paste the contents to a second computer. This effectively makes the second (copy) an exact clone of the first system in its current state. I’ve often done this to rescue a complex installation from a failing hard drive when I didn’t feel like building up the whole thing again from scratch on its new drive. It’s also great when you want to quickly provide identical system setups to multiple users, like student workstations in a classroom.

Note

Don’t even think about trying any of this with Windows. For all intents and purposes, the Windows registry architecture makes it impossible to separate an installed OS from its original hardware.

Although we’re going to spend the rest of this chapter talking about backups rather than images, don’t worry. The tools we’d use for creating and restoring images are pretty much the same, so you’ll be fine either way.

Data backups

Backups should be a big part of your life. In fact, if you never worry about the health of your data, then either you’re a Zen master or you’re just not doing your job right. There’s so much scary stuff that can happen:

Hardware can—and will—fail. And it’ll usually happen right before you were planning to get to that big backup. Really.
Fat fingers (by which I mean clumsy people) and keyboards can conspire to mangle configuration files, leaving you completely locked out of your encrypted system. Having a fallback copy available can save your job and, quite possibly, your life.
Data insecurely stored on cloud infrastructure providers like Amazon Web Services (AWS) can be suddenly and unpredictably lost. Back in 2014, this happened to a company called Code Spaces. The company’s improperly configured AWS account console was breached, and the attackers deleted most of its data. How did Code Spaces recover? Well, when was the last time you heard anything about Code Spaces?
Perhaps most terrifying of all, you could become the victim of a ransomware attack that encrypts or disables all your files unless you pay a large ransom. Got a reliable and recent backup? Feel free to tell the attackers just what you think.

Before moving on, I should mention that untested data backups may not actually work. In fact, there’s evidence to suggest that nearly half of the time they don’t. What’s the problem? There’s a lot that can go wrong: there could be flaws on your backup device, the archive file might become corrupted, or the initial backup itself might have been unable to properly process all of your files.

Generating and monitoring log messages can help you spot problems, but the only way to be reasonably confident about a backup is to run a trial restore onto matching hardware. That will take energy, time, and money. But it sure beats the alternative. The best system administrators I’ve known all seem to share the same sentiment: “Paranoid is only the beginning.”

4.2. What to archive

If there aren’t too many files you want to back up and they’re not too large, you might as well transfer them to their storage destination as is. Use something like the SCP program you saw in chapter 3. This example uses SCP to copy the contents of my public encryption key into a file called authorized_keys on a remote machine:

ubuntu@base:~$ scp .ssh/id_rsa.pub 
  [email protected]:/home/ubuntu/.ssh/authorized_keys        1

1 Overwrites the current contents of the remote authorized_keys file

But if you want to back up many files spread across multiple directories (a complicated project with source code, for instance) or even entire partitions (like the OS you’re running right now), you’re going to need something with more bite.

Although we discussed disk partitions and pseudo files in chapter 1, if you want to develop some kind of intelligent backup policy, you’ll want to get a feel for what they look like. Suppose you’re planning a backup of a partition containing your company’s large accounting database; you probably won’t get too far without knowing how much space that partition takes up and how to find it.

Let’s begin with the df command, which displays each partition that’s currently mounted on a Linux system, along with its disk usage and location on the file system. Adding the -h flag converts partition sizes to human readable formats like GB or MB, rather than bytes:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       910G  178G  686G  21% /                  1
none            492K     0  492K   0% /dev
tmpfs           3.6G     0  3.6G   0% /dev/shm           2
tmpfs           3.6G  8.4M  3.6G   1% /run               3
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.6G     0  3.6G   0% /sys/fs/cgroup

1 The root partition: the only normal partition on this system
2 Note the 0 bytes for disk usage. That (usually) indicates a pseudo file system.
3 The /run directory contains files with runtime data generated during boot.

The first partition listed is designated as /dev/sda2, which means that it’s the second partition on Storage Device A and that it’s represented as a system resource through the pseudo file system directory, /dev/. This happens to be the primary OS partition in this case. All devices associated with a system will be represented by a file in the /dev/ directory. (The partition used by your accounting software would appear somewhere on this list, perhaps designated using something like /dev/sdb1.)

Note

Running df on an LXC container displays the partitions associated with the LXC host.

It’s important to distinguish between real and pseudo file systems (file systems whose files aren’t actually saved to disk but live in volatile memory and disappear when the machine shuts down). After all, there’s no point backing up files that represent an ephemeral hardware profile and, in any case, will be automatically replaced by the OS whenever, and wherever, the real file system is booted next.

It’s pretty simple to tell which partitions are used for pseudo files: if the file designation is tmpfs and the number of bytes reported in the Used column is 0, then the odds are you’re looking at a temporary rather than a normal file system.

By the way, that df was run on an LXC container, which is why there’s only one real partition, /. Let’s see what it shows us when run on a physical computer:

df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.5G     0  3.5G   0% /dev
tmpfs           724M  1.5M  722M   1% /run
/dev/sda2       910G  178G  686G  21% /
tmpfs           3.6G  549M  3.0G  16% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           3.6G     0  3.6G   0% /sys/fs/cgroup
/dev/sda1       511M  3.4M  508M   1% /boot/efi            1
tmpfs           724M   92K  724M   1% /run/user/1000
/dev/sdb1       1.5G  1.5G     0 100% /mnt/UB-16           2

1 This partition was created during installation to enable UEFI booting.
2 sdb1 is a USB thumb drive containing an Ubuntu live boot image.

Notice the /dev/sda1 partition mounted at /boot/efi. This partition was created during the original Linux installation to permit system boots controlled by the UEFI firmware. UEFI has now largely replaced the old BIOS interface that was used for hardware initialization during system boot. Software installed on this partition allows UEFI integration with a Linux system. And /dev/sdb1 is a USB thumb drive that happened to be plugged into the back of my machine.

When you’re dealing with production servers, you’ll often see separate partitions for directories like /var/ and /usr/. This is often done to make it easier to maintain the integrity and security of sensitive data, or to protect the rest of the system from being overrun by file bloat from, say, the log files on /var/log/. Whatever the reason, for any particular disk design, you’ll want to make informed decisions about what needs backing up and what doesn’t.

You’ll sometimes see the /boot/ directory given its own partition. I personally think this is a bad idea, and I’ve got scars to prove it. The problem is that new kernel images are written to /boot/ and, as your system is upgraded to new Linux kernel releases, the disk space required to store all those images increases. If, as is a standard practice, you assign only 500 MB to the boot partition, you’ll have six months or so before it fills up—at which point updates will fail. You may be unable to fully boot into Linux before manually removing some of the older files and then updating the GRUB menu. If that doesn’t sound like a lot of fun, then keep your /boot/ directory in the largest partition.

4.3. Where to back up

From an OS perspective, it doesn’t make a difference where you send your archives. Feel free to choose between legacy tape drives, USB-mounted SATA storage drives, network-attached storage (NAS), storage area networks (SAN), or a cloud storage solution. For more on that, see my book Learn Amazon Web Services in a Month of Lunches (Manning, 2017).

Whichever way you go, be sure to carefully follow best practices. In no particular order, your backups should all be:

Reliable—Use only storage media that are reasonably likely to retain their integrity for the length of time you intend to use them.
Tested—Test restoring as many archive runs as possible in simulated production environments.
Rotated—Maintain at least a few historical archives older than the current backup in case the latest one should somehow fail.
Distributed—Make sure that at least some of your archives are stored in a physically remote location. In case of fire or other disaster, you don’t want your data to disappear along with the office.
Secure—Never expose your data to insecure networks or storage sites at any time during the process.
Compliant—Honor all relevant regulatory and industry standards at all times.
Up to date—What’s the point keeping archives that are weeks or months behind the current live version?
Scripted—Never rely on a human being to remember to perform an ongoing task. Automate it (read chapter 5).

4.4. Archiving files and file systems using tar

To successfully create your archive, there are three things that will have to happen:

Find and identify the files you want to include.
Identify the location on a storage drive that you want your archive to use.
Add your files to an archive, and save it to its storage location.

Want to knock off all three steps in one go? Use tar. Call me a hopeless romantic, but I see poetry in a well-crafted tar command: a single, carefully balanced line of code accomplishing so much can be a thing of beauty.

4.4.1. Simple archive and compression examples

This example copies all the files and directories within and below the current work directory and builds an archive file that I’ve cleverly named archivename.tar. Here I use three arguments after the tar command: the c tells tar to create a new archive, v sets the screen output to verbose so I’ll get updates, and f points to the filename I’d like the archive to get:

$ tar cvf archivename.tar *
file1                          1
file2
file3

1 The verbose argument (v) lists the names of all the files added to the archive.

Note

The tar command will never move or delete any of the original directories and files you feed it; it only makes archived copies. You should also note that using a dot (.) instead of an asterisk (*) in the previous command will include even hidden files (whose filenames begin with a dot) in the archive.

If you’re following along on your own computer (as you definitely should), then you’ll see a new file named archivename.tar. The .tar filename extension isn’t necessary, but it’s always a good idea to clearly communicate the purpose of a file in as many ways as possible.

You won’t always want to include all the files within a directory tree in your archive. Suppose you’ve produced some videos, but the originals are currently kept in directories along with all kinds of graphic, audio, and text files (containing your notes). The only files you need to back up are the final video clips using the .mp4 filename extension. Here’s how to do that:

$ tar cvf archivename.tar *.mp4

That’s excellent. But those video files are enormous. Wouldn’t it be nice to make that archive a bit smaller using compression? Say no more! Just run the previous command with the z (zip) argument. That will tell the gzip program to compress the archive. If you want to follow convention, you can also add a .gz extension in addition to the .tar that’s already there. Remember: clarity. Here’s how that would play out:

$ tar czvf archivename.tar.gz *.mp4

If you try this out on your own .mp4 files and then run ls -l on the directory containing the new archives, you may notice that the .tar.gz file isn’t all that much smaller than the .tar file, perhaps 10% or so. What’s with that? Well, the .mp4 file format is itself compressed, so there’s a lot less room for gzip to do its stuff.

As tar is fully aware of its Linux environment, you can use it to select files and directories that live outside your current working directory. This example adds all the .mp4 files in the /home/myuser/Videos/ directory:

$ tar czvf archivename.tar.gz /home/myuser/Videos/*.mp4

Because archive files can get big, it might sometimes make sense to break them down into multiple smaller files, transfer them to their new home, and then re-create the original file at the other end. The split tool is made for this purpose.

In this example, -b tells Linux to split the archivename.tar.gz file into 1 GB-sized parts; archivename is any name you’d like to give the file. The operation then names each of the parts—archivename.tar.gz.partaa, archivename.tar.gz.partab, archivename .tar.gz.partac, and so on:

$ split -b 1G archivename.tar.gz "archivename.tar.gz.part"

On the other side, you re-create the archive by reading each of the parts in sequence (cat archivename.tar.gz.part*), then redirect the output to a new file called archivename.tar.gz:

$ cat archivename.tar.gz.part* > archivename.tar.gz

4.4.2. Streaming file system archives

Here’s where the poetry starts. I’m going to show you how to create an archive image of a working Linux installation and stream it to a remote storage location—all within a single command (figure 4.3).

Figure 4.3. An archive is a file that can be copied or moved using normal Bash tools.

Here’s the command:

# tar czvf - --one-file-system / /usr /var 
  --exclude=/home/andy/ | ssh [email protected] 
  "cat > /home/username/workstation-backup-Apr-10.tar.gz"

Rather than trying to explain all that right away, I’ll use smaller examples to explore it one piece at a time. Let’s create an archive of the contents of a directory called importantstuff that’s filled with, well, really important stuff:

$ tar czvf - importantstuff/ | ssh [email protected] 
<linearrow />   "cat > /home/username/myfiles.tar.gz"
importantstuff/filename1
importantstuff/filename2
[...]
[email protected]'s password:            1

1 You’ll need to enter the password for your account on the remote host.

Let me explain that example. Rather than entering the archive name right after the command arguments (the way you’ve done until now), I used a dash (czvf -). The dash outputs data to standard output. It lets you push the archive filename details back to the end of the command and tells tar to expect the source content for the archive instead. I then piped (|) the unnamed, compressed archive to an ssh login on a remote server where I was asked for my password. The command enclosed in quotation marks then executed cat against the archive data stream, which wrote the stream contents to a file called myfiles.tar.gz in my home directory on the remote host.

As you can see in figure 4.4, one advantage of generating archives this way is that you avoid the overhead of a middle step. There’s no need to even temporarily save a copy of the archive on the local machine. Imagine backing up an installation that fills 110 GB of its 128 GB of available space. Where would the archive go?

Figure 4.4. Streaming an archive as it’s created avoids the need to first save it to a local drive.

That was just a directory of files. Suppose you need to back up an active Linux OS to a USB drive so you can move it over to a separate machine and drop it into that machine’s main drive. Assuming there’s already a fresh installation of the same Linux version on the second machine, the next copy/paste operation will generate an exact replica of the first.

Note

This won’t work on a target drive that doesn’t already have a Linux file system installed. To handle that situation, as you’ll see shortly, you’ll need to use dd.

The next example creates a compressed archive on the USB drive known as /dev/sdc1. The --one-file-system argument excludes all data from any file system besides the current one. This means that pseudo partitions like /sys/ and /dev/ won’t be added to the archive. If there are other partitions that you want to include (as you’ll do for /usr/ and /var/ in this example), then they should be explicitly added. Finally, you can exclude data from the current file system using the --exclude argument:

# tar czvf /dev/sdc1/workstation-backup-Apr-10.tar.gz 
 --one-file-system                                      1
 / /usr /var                                            2
 --exclude=/home/andy/                                   3

1 Excludes data from other partitions when building the archive
2 References /usr and /var partitions explicitly
3 Excludes directories or files within a selected file system when necessary (poor old Andy)

Now let’s go back to that full-service command example. Using what you’ve already learned, archive all the important directories of a file system and copy the archive file to a USB drive. It should make sense to you now:

# tar czvf - --one-file-system / /usr /var 
  --exclude=/home/andy/ | ssh [email protected] 
  "cat > /home/username/workstation-backup-Apr-10.tar.gz"

All that’s fine if the files you need to archive (and only those files) are agreeably hanging out together in a single directory hierarchy. But what if there are other files mixed in that you don’t want to include? Is there a way to aggregate only certain files without having to mess with the source files themselves? It’s time you learn about find.

4.4.3. Aggregating files with find

The find command searches through a file system looking for objects that match rules you provide. The search outputs the names and locations of the files it discovers to what’s called standard output (stdout), which normally prints to the screen. But that output can just as easily be redirected to another command like tar, which would then copy those files to an archive.

Here’s the story. Your server is hosting a website that provides lots of .mp4 video files. The files are spread across many directories within the /var/www/html/ tree, so identifying them individually would be a pain. Here’s a single command that will search the /var/www/html/ hierarchy for files with names that include the file extension .mp4. When a file is found, tar will be executed with the argument -r to append (as opposed to overwrite) the video file to a file called videos.tar:

# find /var/www/html/ -iname <1> "*.mp4" -exec tar        1
 -rvf videos.tar {} ;                                     2

1 The -iname flag returns both upper- and lowercase results; -name, on the other hand, searches for case-sensitive matches.
2 The { } characters tell the find command to apply the tar command to each file it finds.

In this case, it’s a good idea to run find as sudo. Because you’re looking for files in system directories, it’s possible that some of them have restrictive permissions that could prevent find from reading and, thus, reporting them.

And, because we’re talking about find, I should also tell you about a similar tool called locate that will often be your first choice when you’re in a big hurry. By default, locate searches the entire system for files matching the string that you specify. In this case, locate will look for files whose names end with the string video.mp4 (even if they have any kind of prefix):

$ locate *video.mp4

If you run locate head-to-head against find, locate will almost always return results far faster. What’s the secret? locate isn’t actually searching the file system itself, but simply running your search string against entries in a preexisting index. The catch is that if the index is allowed to fall out of date, the searches become less and less accurate. Normally the index is updated every time the system boots, but you can also manually do the job by running updatedb:

# updatedb

4.4.4. Preserving permissions and ownership...and extracting archives

Did I miss anything? Actually, how to extract the files and directories from a tar archive so you can use them again. But before I get to that, there’s another bit of business I promised I’d take care of—making sure that your archive operations don’t corrupt file permissions and file-ownership attributes.

Permissions

As you’ve seen, running ls -l lists the contents of a directory in long form, showing you (from right to left) the file’s name, age, and size. But it also repeats a name (root, in this example) and provides some rather cryptic strings made up of the letters r, w, and x:

$ ls -l /bin | grep zcat
-rwxr-xr-x 1 root root 1937 Oct 27 2014 zcat

Here’s where I decipher those two leftmost sections (as annotated in figure 4.5). The 10 characters to the left are made up of four separate sections. The first dash (1 in the figure) means that the object being listed is a file. It would be replaced with a d if it were a directory. The next three characters 2 are a representation of the file’s permissions as they apply to its owner, the next three 3 are the permissions as they apply to its group, and the final three 4 represent the permissions all other users have over this file.

Figure 4.5. A breakdown of the data displayed by the `ls -l` command

In this example, the file owner has full authority—including read (r), write (w), and execute (x) rights. Members of the group and those in others can read and execute, but not write.

But what does all that really mean? Here, the file zcat is the script of a command-line program that reads compressed files. The permissions tell you that everyone has the right to read the script itself and to execute it (through something like zcat myfile .zip), but only the owner can edit (w) the file. If someone who’s logged in to a different user account were to try to edit the file, they’d get a No Write Permission warning.

If you want to change a file’s permissions, use the change mode (chmod) tool:

# chmod o-r /bin/zcat
# chmod g+w /bin/zcat

This example removes the ability of others (o) to read the file and adds write permissions for the group (g). A file’s owner would be represented by the letter u (for user).

What’s a group?

You can think of a group much the same way you might think of a regular user account: the things that both can and cannot do or access are defined by file permissions. The difference is that no one can log in to a Linux system as a group. Then why create groups, and what purpose do they serve? Here’s the scoop.

Groups are a powerful and super-efficient way to organize resources. Here’s a simple example. Consider a company with a few dozen employees who need some kind of server access, but not necessarily to the same resources. You can create a couple of groups called dev and IT, for example. When users are initially given their accounts, all the developers would be added to the dev group, and all the sysadmins would be added to IT group. Now, let’s say that a system configuration file comes into use: rather than tediously adding file permissions for each of the 10 or 15 admins or so, you can give only the IT group access. All the IT group members will automatically be added, and all the developers will remain excluded.

Every system user along with many applications will automatically be given their own groups. That explains why files you create will normally be owned by yourname and be part of the yourname group. If you decide to stick around, you’ll see more implementations of groups in chapter 9.

You’ll find two other systems for describing permissions in Linux: numeric and mask. Talking about mask would be a bit distracting at this point, and, in any case, mask isn’t used all that often. But you do need to understand the numeric system where each possible combination of permissions can be represented by a number between 0 and 7.

How-to guides and command documentation will often tell you to give a file 644 permissions (or something similar) in order for an operation to run successfully. For instance, invoking the private part of an encryption key pair will often not work unless it has permissions of either 400 or 600. You’ll want to know how that works.

The read permission is always given the number 4; the write permission, number 2; and execute, number 1. A user with all three permissions is described by the number 7 (4+2+1=7). Read and write permissions, but not execute, is 6; read and execute but not write is 5, and no permissions at all is 0.

To change an object’s permissions, you’d enter the final total scores for each category of user (that is, owner, group, and others). For example, updating the original status of the zcat file the way you did earlier using chmod g+w and o-r would require 755 (7 for the owner and then 5 for both group and others). Removing read permission from others would change that to 751, and adding write permissions to group would change that again to 771. Here’s how you’d use chmod to apply that value:

# chmod 771 /bin/zcat

Here’s a quick chart to help you remember all those details:

Permission	Character	Number
Read	r	4
Write	w	2
Execute	x	1

Ownership

What about those file ownership values? This one is straightforward: these are the values that define a file’s owner (u) and group (g). Check it out yourself. From your home directory, create a new file and then list the directory contents in long form. You’ll see that the values of both the owner and group match your user name. In this example, that’s username:

$ cd
$ touch newfile
$ ls -l
-rw-rw-r-- 1 username username 0 Jun 20 20:14 newfile

I rarely go more than a couple of days without having to worry about file ownership. Suppose one of my users asks for a file. The file might be too large to email or might contain sensitive data that shouldn’t be emailed. The obvious solution is to copy it if I’m on the same server. If I’m on a different server, I can always use scp to transfer the file and then copy the file to the user’s home directory. But either way, I’ll need to use sudo to copy a file to the user’s directory, which means its owner will be root.

Don’t believe me? Try creating a file using sudo:

$ sudo touch newerfile
[sudo] password for username:
$ ls -l
-rw-r--r-- 1 root root 0 Jun 20 20:37 newerfile        1

1 Note that the owner and group for this file is root.

Well now, that’s going to be a real problem if my user ever needs to edit the file I was kind enough to send. Turns out that I wasn’t being so helpful after all—unless I do the job properly and change the file’s ownership using chown, which works much like the chmod command that you saw earlier. This example assumes that the account name of that other user is otheruser. Go ahead and create such an account using sudo useradd otheruser:

$ sudo chown otheruser:otheruser newerfile
$ ls -l
-rw-r--r-- 1 otheruser otheruser 0 Jun 20 20:37 newerfile      1

1 Note the new file owner and group.

That’s permissions and ownership. But what does it have to do with extracting your archives? Well, would you be upset if I told you that there was a good chance that all the files and directories restored after a catastrophic system crash would have the wrong permissions? I thought so. Think about it: you rebuild your system and invite all your users to log in once again, but they immediately start complaining that they can’t edit their own files!

I think it will be helpful for you to see all this for yourself. So you can work through these examples on your own, create a new directory and populate it with a few empty files and then, if there aren’t already any other user accounts on your system, create one:

$ mkdir tempdir && cd tempdir          1
$ touch file1
$ touch file2
$ touch file3
# useradd newuser

1 The && characters will execute a second command only if the first command was successful.

Right now, all three files will be owned by you. Use chown to change the ownership of one of those files to your new user, and then use ls -l to confirm that one of the files now belongs to the new user:

# chown newuser:newuser file3
$ ls -l
-rw-rw-r-- 1 username username 0 Jun 20 11:31 file1
-rw-rw-r-- 1 username username 0 Jun 20 11:31 file2
-rw-rw-r-- 1 newuser  newuser  0 Jun 20 11:31 file3

Now create a tar archive including all the files in the current directory the way you did before:

$ tar cvf stuff.tar *
file1
file2
file3

To extract the archive, run the tar command against the name of the archive, but this time with the argument x (for extract) rather than c:

$ tar xvf stuff.tar

Warning

Extracting an archive overwrites any files with the same names in the current directory without warning. Here, that’s fine, but that won’t normally be the case.

Running ls -l once again will show something you don’t want to see. All three files are now owned by you...even file3:

$ ls -l
-rw-rw-r-- 1 username username 0 Jun 20 11:31 file1
-rw-rw-r-- 1 username username 0 Jun 20 11:31 file2
-rw-rw-r-- 1 username username 0 Jun 20 11:31 file3

That’s not good, and I’m sure our friend newuser won’t be happy about it either. What’s the solution? Well, first of all, let’s try to figure out exactly what the problem is.

Generally, only users with administrator powers can work with resources in other users’ accounts. If I wanted to, say, transfer the ownership of one of my files to a colleague, because it requires a change to someone else’s account, I can’t do that. Generosity has its limits. Therefore, when I try to restore the files from the archive, saving them to the ownership of other users would be impossible. Restoring files with their original permissions presents a similar (although not identical) problem. The solution is to perform these operations as an administrator, using sudo. Now you know.

4.5. Archiving partitions with dd

There’s all kinds of stuff you can do with dd if you research hard enough, but where it shines is in the ways it lets you play with partitions. Earlier, you used tar to replicate entire file systems by copying the files from one computer and then pasted them as is on top of a fresh Linux install of another computer. But because those file system archives weren’t complete images, they required a running host OS to serve as a base.

Using dd, on the other hand, can make perfect byte-for-byte images of, well, just about anything digital. But before you start flinging partitions from one end of the earth to the other, I should mention that there’s some truth to that old UNIX admin joke: dd stands for Disk Destroyer. If you type even one wrong character in a dd command, you can instantly and permanently wipe out an entire drive worth of valuable data. And yes, spelling counts.

Note

As always with dd, pause and think very carefully before pressing that Enter key!

4.5.1. dd operations

Now that you’ve been suitably warned, we’ll start with something straightforward. Suppose you want to create an exact image of an entire disk of data that’s been designated as /dev/sda. You’ve plugged in an empty drive (ideally having the same capacity as your /dev/sdb system). The syntax is simple: if= defines the source drive, and of= defines the file or location where you want your data saved:

# dd if=/dev/sda of=/dev/sdb

The next example will create a .img archive of the /dev/sda drive and save it to the home directory of your user account:

# dd if=/dev/sda of=/home/username/sdadisk.img

Those commands created images of entire drives. You could also focus on a single partition from a drive. The next example does that and also uses bs to set the number of bytes to copy at a single time (4,096, in this case). Playing with the bs value can have an impact on the overall speed of a dd operation, although the ideal setting will depend on hardware and other considerations:

# dd if=/dev/sda2 of=/home/username/partition2.img bs=4096

Restoring is simple: effectively, you reverse the values of if and of. In this case, if= takes the image that you want to restore, and of= takes the target drive to which you want to write the image:

# dd if=sdadisk.img of=/dev/sdb

You should always test your archives to confirm they’re working. If it’s a boot drive you’ve created, stick it into a computer and see if it launches as expected. If it’s a normal data partition, mount it to make sure the files both exist and are appropriately accessible.

4.5.2. Wiping disks with dd

Years ago I had a friend who was responsible for security at his government’s overseas embassies. He once told me that each embassy under his watch was provided with an official government-issue hammer. Why? In case the facility was ever at risk of being overrun by unfriendlies, the hammer was to be used to destroy all their hard drives.

What’s that? Why not just delete the data? You’re kidding, right? Everyone knows that deleting files containing sensitive data from storage devices doesn’t actually remove them. Given enough time and motivation, nearly anything can be retrieved from virtually any digital media, with the possible exception of the ones that have been well and properly hammered.

You can, however, use dd to make it a whole lot more difficult for the bad guys to get at your old data. This command will spend some time writing millions and millions of zeros over every nook and cranny of the /dev/sda1 partition:

# dd if=/dev/zero of=/dev/sda1

But it gets better. Using the /dev/urandom file as your source, you can write over a disk with random characters:

# dd if=/dev/urandom of=/dev/sda1

4.6. Synchronizing archives with rsync

One thing you already know about proper backups is that, to be effective, they absolutely have to happen regularly. One problem with that is that daily transfers of huge archives can place a lot of strain on your network resources. Wouldn’t it be nice if you only had to transfer the small handful of files that had been created or updated since the last time, rather than the whole file system? Done. Say hello to rsync.

I’m going to show you how to create a remote copy of a directory full of files and maintain the accuracy of the copy even after the local files change. (You’ll first need to make sure that the rsync package is installed on both the client and host machines you’ll be using.) To illustrate this happening between your own local machine and a remote server (perhaps an LXC container you’ve got running), create a directory and populate it with a handful of empty files:

$ mkdir mynewdir && cd mynewdir
$ touch file{1..10}                1

1 Creates 10 files named file1 to file10

Now use ssh to create a new directory on your remote server where the copied files will go, and then run rsync with the -av arguments. The v tells rsync to display a verbose list of everything it does. a is a bit more complicated, but also a whole lot more important. Specifying the -a super-argument will make rsync synchronize recursively (meaning that subdirectories and their contents will also be included) and preserve special files, modification times, and (critically) ownership and permissions attributes. I’ll bet you’re all-in for -a. Here’s the example:

$ ssh [email protected] "mkdir syncdirectory"
$ rsync -av * [email protected]:syncdirectory       1
[email protected]'s password:
sending incremental file list
file1                                                 2
file10
file2
file3
file4
file5
file6
file7
file8
file9

sent 567 bytes  received 206 bytes  1,546.00 bytes/sec
total size is 0  speedup is 0.00

1 Specify a remote target directory following the colon (:).
2 The verbose argument displays the files that were copied.

If everything went as it should, head over to your remove server and list the contents of /syncdirectory/. There should be 10 empty files.

To give rsync a proper test run, you could add a new file to the local mynewdir directory and use nano to, say, add a few words to one of the existing files. Then run the exact same rsync command as before. When it’s done, see if the new file and updated version of the old one have made it to the remote server:

$ touch newfile
$ nano file3
$ rsync -av * [email protected]:syncdirectory
[email protected]'s password:
sending incremental file list
file3                             1
newfile

1 Only the new/updated files are listed in the output.

There’s a whole lot more rsync backup goodness waiting for you to discover. But, as with all the other tools I discuss in this book, you now have the basics. Where you go from here is up to you. In the next chapter, however, you’ll learn about automating backups using system schedulers. For now, there’s one final thought I’d like to share about backups.

4.7. Planning considerations

Careful consideration will go a long way to determine how much money and effort you invest in your backups. The more valuable your data is to you, the more reliable it should be. The goal is to measure the value of your data against these questions:

How often should you create new archives, and how long will you retain old copies?
How many layers of validation will you build into your backup process?
How many concurrent copies of your data will you maintain?
How important is maintaining geographically remote archives?

Another equally important question: should you consider incremental or differential backups? Although you’re probably going to want to use rsync either way, the way you sequence your backups can have an impact on both the resources they consume and the availability of the archives they produce.

Using a differential system, you might run a full backup once a week (Monday), and smaller and quicker differential backups on each of the next six days. The Tuesday backup will include only files changed since Monday’s backup. The Wednesday, Thursday, and Friday backups will each include all files changed since Monday. Friday’s backup will, obviously, take up more time and space than Tuesday’s. On the plus side, restoring a differential archive requires only the last full backup and the most recent differential backup.

An incremental system might also perform full backups only on Mondays and can also run a backup covering only changed files on Tuesday. Wednesday’s backup, unlike the differential approach, will include only files added or changed since Tuesday, and Thursday’s will have only those changed since Wednesday. Incremental backups will be fast and efficient; but, as the updated data is spread across more files, restoring incremental archives can be time-consuming and complicated. This is illustrated in figure 4.6.

Figure 4.6. The differences between incremental and differential backup systems

Summary

Not having good backups can ruin your morning.
The tar command is generally used for archiving full or partial file systems, whereas dd is more suited for imaging partitions.
Adding compression to an archive not only saves space on storage drives, but also bandwidth during a network transfer.
Directories containing pseudo file systems usually don’t need backing up.
You can incorporate the transfer of an archive into the command that generates it, optionally avoiding any need to save the archive locally.
It’s possible—and preferred—to preserve the ownership and permissions attributes of objects restored from an archive.
You can use dd to (fairly) securely wipe old disks.
You can incrementally synchronize archives using rsync, greatly reducing the time and network resources needed for ongoing backups.

Key terms

An archive is a specially formatted file in which file system objects are bundled.
Compression is a process for reducing the disk space used by a file through the application of a compression algorithm.
An image is an archive containing the files and directory structure necessary to re-create a source file system in a new location.
Permissions are the attributes assigned to an object that determine who may use it and how.
Ownership is the owner and group that have authority over an object.
A group is an account used to manage permissions for multiple users.

Security best practices

Create an automated, reliable, tested, and secure recurring process for backing up all of your important data.
Where appropriate, separate file systems with sensitive data by placing them on their own partitions and mounting them to the file system at boot time.
Always ensure that file permissions are accurate, and allow only the least access necessary.
Never assume the data on an old storage drive is truly deleted.

Command-line review

df -h displays all currently active partitions with sizes shown in a human readable format.
tar czvf archivename.tar.gz /home/myuser/Videos/*.mp4 creates a compressed archive from video files in a specified directory tree.
split -b 1G archivename.tar.gz archivename.tar.gz.part splits a large file into smaller files of a set maximum size.
find /var/www/ -iname "*.mp4" -exec tar -rvf videos.tar {} ; finds files meeting a set criteria and streams their names to tar to include in an archive.
chmod o-r /bin/zcat removes read permissions for others.
dd if=/dev/sda2 of=/home/username/partition2.img creates an image of the sda2 partition and saves it to your home directory.
dd if=/dev/urandom of=/dev/sda1 overwrites a partition with random characters to obscure the old data.

Test yourself

1
Which of these arguments tells tar to compress an archive?

-a

-v

-z

-c

2
Which of these partitions are you least likely to want to include in a backup archive?

/var

/run

/

/home

3
The second partition on the first storage drive on a system will usually be designated by which of the following?

/dev/sdb2

/dev/srb0

/dev/sda2

/dev/sdb1

4
Which of the following will create a compressed archive of all the .mp4 files in a directory?

tar cvf archivename.tar.gz *.mp4

tar cvf *.mp4 archivename.tar.gz

tar czvf archivename.tar.gz *.mp4

tar *.mp4 czvf archivename.tar

5
Which of the following tools will help you put multiple file parts back together?

cat

split

|

part

6
What of the following will find all .mp4 files within the specified directories and add them to a tar archive?

find /var/www/ -iname "*" -exec tar -rvf videos.tar {} ;

find /var/www/ -iname "*.mp4" -exec tar -vf videos.tar {} ;

find /var/www/ -iname "*.mp4" | tar -rvf videos.tar {} ;

find /var/www/ -iname "*.mp4" -exec tar -rvf videos.tar {} ;

7
Which of the following will give a file’s owner full rights, its group read and execute rights, and others only execute rights?

chmod 752

chmod 751

chmod 651

chmod 744

8
What will the command dd if=sdadisk.img of=/dev/sdb do?

Copy the contents of the /dev/sdb drive to a file called sdadisk.img

Destroy all data on the network

Copy an image called sdadisk.img to the /dev/sdb drive

Format the /dev/sdb drive and then move sdadisk.img to it

Answer key

1.
c

2.
b

3.
c

4.
c

5.
a

6.
d

7.
b

8.
c

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Archive management: Backing up or copying entire file systems

Create new playlist

Sign In

Sign Up

Chapter 4. Archive management: Backing up or copying entire file systems

4.1. Why archive?

Figure 4.1. Files and directories can be bundled into an archive file and saved to the file system.

4.1.1. Compression

Figure 4.2. Object compression by eliminating statistical redundancy and/or removing less important parts of a file

4.1.2. Archives: Some important considerations

Images

Note

Data backups

4.2. What to archive

Note

4.3. Where to back up

4.4. Archiving files and file systems using tar

4.4.1. Simple archive and compression examples

Note

4.4.2. Streaming file system archives

Figure 4.3. An archive is a file that can be copied or moved using normal Bash tools.

Figure 4.4. Streaming an archive as it’s created avoids the need to first save it to a local drive.

Note

4.4.3. Aggregating files with find

4.4.4. Preserving permissions and ownership...and extracting archives

Permissions

Figure 4.5. A breakdown of the data displayed by the ls -l command

Ownership

Warning

4.5. Archiving partitions with dd

Note

4.5.1. dd operations

4.5.2. Wiping disks with dd

4.6. Synchronizing archives with rsync

4.7. Planning considerations

Figure 4.6. The differences between incremental and differential backup systems

Summary

Key terms

Security best practices

Command-line review

Test yourself

Answer key

Table of Contents for
Chapter 4. Archive management: Backing up or copying entire file systems

Figure 4.5. A breakdown of the data displayed by the `ls -l` command