23

Preventing Disasters

In an enterprise network, a disaster can strike at any time. While, as administrators, we always do our best to design the most stable and fault-tolerant server implementations we possibly can, what matters most is how we are able to deal with disasters when they do happen. As stable as server hardware is, any component of a server can fail at any time. In the face of a disaster, we need a plan. How can you attempt to recover data from a failed disk? What do you do when your server all of a sudden decides it doesn't want to boot? These are just some of the questions we'll answer as we take a look at several ways we can prevent and recover from disasters. In our final chapter, we'll cover the following topics:

  • Preventing disasters
  • Utilizing Git for configuration management
  • Implementing a backup plan
  • Replacing failed RAID disks
  • Utilizing bootable recovery media

We'll start off our final chapter by looking at a few tips to help prevent disaster.

Preventing disasters

As we proceed through this chapter, we'll look at ways we can recover from disasters. However, if we can prevent a disaster from occurring in the first place, then that's even better. We certainly can't prevent every type of disaster that could possibly happen but having a good plan in place and following that plan will lessen the likelihood. A good disaster recovery plan will include a list of guidelines to be followed with regard to implementing new servers and managing current ones.

This plan may include information such as an approved list of hardware (such as hardware configurations known to work efficiently in an environment), as well as rules and regulations for users, a list of guidelines to ensure physical and software security, proper training for end users, and method change control. Some of these concepts we've touched on earlier in the book but are worth repeating from the standpoint of disaster prevention.

First, we talked about the principle of least privilege back in Chapter 21, Securing Your Server. The idea is to give your users as few permissions as possible. This is very important for security, as you want to ensure only those trained in their specific jobs are able to access and modify only the resources that they are required to. Accidental data deletion happens all the time. To take full advantage of this principle, create a set of groups as part of your overall security design. List departments and positions in your company and the types of activities each is required to perform. Create system groups that correspond to those activities. For example, create an accounting-ro and accounting-rw group to categorize users within your Accounting department that should have the ability to only read or read and write data. If you're simply managing a home file server, be careful of open network shares where users have read and write access by default. By allowing users to do as little as possible, you'll prevent a great many disasters right away.

In Chapter 2, Managing Users and Permissions (as well as Chapter 21, Securing Your Server), we talked about best practices for the sudo command. While the sudo command is useful, it's often misused. By default, anyone that's a member of the sudo group can use sudo to do whatever they want. We talked about how to restrict sudo access to particular commands, which is always recommended. Only trusted administrators should have full access to sudo. Everyone else should have sudo permissions only if they really need them, and even then, only when it comes to commands that are required for their job. A user with full access to sudo can delete an entire filesystem, so it should never be taken lightly.

In regard to network shares, it's always best to default to read-only whenever possible. This isn't just because of the possibility of a user accidentally deleting data; it's always possible for applications to malfunction and delete data as well. With a read-only share, the modification or deletion of files isn't possible. Additional read-write shares can be created for those who need it, but if possible, always default to read-only.

Although I've spent a lot of time discussing security in a software sense, physical security is important too. For the purposes of this book, physical security doesn't really enter the discussion much because our topic is specifically Ubuntu Server, and nothing you install on Ubuntu is going to increase the physical security of your servers. It's worth noting, however, that physical security is every bit as important as securing your operating systems, applications, and data files.

All it would take is someone tripping over a network cable in a server room to disrupt an entire subnet or cause a production application to go offline. Server rooms should be locked, and only trusted administrators should be allowed to access your equipment. I'm sure this goes without saying and may sound obvious, but I've worked at several companies that did not secure their server room. Nothing good ever comes from placing important equipment within arm's reach of unauthorized individuals.

In this section, I've mentioned Chapter 21, Securing Your Server, a couple of times. A good majority of a disaster prevention plan includes a focus on security. This includes, but is not limited to, ensuring security updates are installed in a timely fashion, utilizing security applications such as failure monitors and firewalls, and ensuring secure settings for OpenSSH. I won't go over these concepts again here since we've already covered them, but essentially security is a very important part of a disaster prevention plan. After all, users cannot break what they cannot access, and hackers will have a harder time penetrating your network if you designed it in a security-conscious way.

Effective disaster prevention consists of a list of guidelines for things such as user management, server management, application installations, security, and procedure documents. A full walkthrough of proper disaster prevention would be an entire book in and of itself. My goal with this section is to provide you with some ideas you can use to begin developing your own plan. A disaster prevention plan is not something you'll create all at once but is rather something you'll create and refine indefinitely as you learn more about security and what types of things to watch out for.

The configuration files on your server determine the behavior of your services and applications, and backing up can enable you to recover their state. When it comes to managing the state of files, Git is a very powerful tool to do just that, and we'll talk about it next.

Utilizing Git for configuration management

One of the most valuable assets on a server is its configuration. This is second only to the data the server stores. Often, when we implement a new technology on a server, we'll spend a great deal of time editing configuration files all over the server to make it work as best as we can. This can include any number of things, from Apache virtual host files to DHCP server configuration, DNS zone files, and more. If a server were to encounter a disaster from which the only recourse was to completely rebuild it, the last thing we'd want to do is re-engineer all of this configuration from scratch. This is where Git comes in.

In a typical development environment, an application being developed by a team of engineers can be managed by Git, each contributing to a repository that hosts the source code for their software. One of the things that makes Git so useful is how you're able to go back to previous versions of a file in an instant, as it keeps a history of all changes made to the files within the repository.

Git isn't just useful for software engineers, though. It's also a really useful tool we can leverage for keeping track of configuration files on our servers. For our use case, we can use it to record changes to configuration files and push them to a central server for backup. When we make configuration changes, we push the changes back to our Git server. If for some reason we need to restore the configuration after a server fails, we can simply download our configuration files from Git back onto our new server. Another useful aspect of this approach is that if an administrator implements a change to a configuration file that breaks a service, we can simply revert to a known working commit and we'll be immediately back up and running. You can even correlate changes in log files to changes made at around the same time in a Git repository, which makes it easier to narrow down the root cause of an issue.

Configuration management on servers is so important, in fact, I highly recommend that every Linux administrator takes advantage of version control for this purpose. Although it may seem a bit tricky at first, it's actually really easy to get going once you practice with it. Once you've implemented Git for keeping track of all your server's configuration files, you'll wonder how you ever lived without it. We covered Git briefly in Chapter 15, Automating Server Configuration with Ansible, where I walked you through creating a repository on GitHub to host Ansible configuration. However, GitHub may or may not be a good place for your company's configuration, because not only do most companies have policies against sharing internal configuration, you may have sensitive information that you wouldn't want others to see. Thankfully, you don't really need GitHub in order to use Git; you can use a local server for Git on your network.

I'll walk you through what you'll need to do in order to implement this approach. To get started, you'll want to install the git package:

sudo apt install git

In regard to your Git server, you don't necessarily have to dedicate a server just for this purpose; you can use an existing server. The only important aspect here is that you have a central server onto which you can store your Git repositories. All your other servers will need to be able to reach it via your network. On whatever machine you've designated as your Git server, install the git package as well. Believe it or not, that's all there is to it. Since Git uses OpenSSH by default, we only need to make sure the git package is installed on the server as well as our clients.

We'll need a directory on that server to house our Git repositories, and the users on your servers that utilize Git will need to be able to modify that directory.

Now, think of a configuration directory that's important to you, that you want to place into version control. A good example is the /etc/apache2 directory on a web server. That's what I'll use in my examples in this section. But you're certainly not limited to that. Any configuration directory you would rather not lose is a good candidate. If you choose to use a different configuration path, change the paths I give you in my examples to that path.

On the server, create a directory to host your repositories. I'll use /git in my examples:

sudo mkdir /git

Next, you'll want to modify this directory to be owned by the administrative user you use on your Ubuntu servers. Typically, this is the user that was created during the installation of the distribution. You can use any user you want actually, just make sure this user is allowed to use OpenSSH to access your Git server. Change the ownership of the /git directory so it is owned by this user. My user on my Git server is jay, so in my case, I would change the ownership with the following command:

sudo chown jay:jay /git

Next, we'll create our Git repository within the /git directory. For Apache, I'll create a bare repository for it within the /git directory. A bare repository is basically a skeleton of a Git repository that doesn't contain any useful data, just some default configuration to allow it to act as a Git folder. To create the bare repository, cd into the /git directory and execute:

git init --bare apache2

You should see the following output:

Initialized empty Git repository in /git/apache2/

That's all we need to do on the server for now for the purposes of our Apache repository. On your client (the server that houses the configuration you want to place under version control), we'll copy this bare repository by cloning it. To set that up, create a /git directory on your Apache server (or whatever kind of server you're backing up) just as we did before. Then, cd into that directory and clone your repository with the following command:

git clone 192.168.1.101:/git/apache2

For that command, replace the IP address with either the IP address of your Git server or its hostname if you've created a DNS entry for it. You should see the following output, warning us that we've cloned an empty repository:

warning: You appear to have cloned an empty repository

This is fine, we haven't actually added anything to our repository yet. If you were to cd into the directory we just cloned and list its storage, you'd see it as an empty directory. If you use ls -a to view hidden directories as well, you'll see a .git directory inside. Inside the .git directory, we'll have configuration items for Git that allow this repository to function properly. For example, the config file in the .git directory contains information on where the remote server is located. We won't be manipulating this directory; I just wanted to give you a quick overview of what its purpose is.

Note that if you delete the .git directory in your cloned repository, that basically removes version control from the directory and makes it a normal directory.

Anyway, let's continue. We should first make a backup of our current /etc/apache2 directory on our web server, in case we make a mistake while converting it to being version controlled:

sudo cp -rp /etc/apache2 /etc/apache2.bak

Then, we can move all the contents of /etc/apache2 into our repository:

sudo mv /etc/apache2/* /git/apache2

The /etc/apache2 directory is now empty. Be careful not to restart Apache at this point; it won't see its configuration files and will fail. Remove the (now empty) /etc/apache2 directory:

sudo rm /etc/apache2

Now, let's make sure that Apache's files are owned by root. The problem though is if we use the chown command, as we normally would to change ownership, we'll also change the .git directory to be owned by root as well. We don't want that, because the user responsible for pushing changes should be the owner of the .git folder. The following command will change the ownership of the files to root, but won't touch hidden directories such as .git:

sudo find /git/apache2 -name '.?*' -prune -o -exec chown root:root {} +

When you list the contents of your repository directory now, you should see that all files are owned by root, except for the .git directory, which should be owned by your administrative user account.

Next, create a symbolic link to your Git repository so the apache2 daemon can find it:

sudo ln -s /git/apache2 /etc/apache2

At this point, you should see a symbolic link for Apache, located at /etc/apache2. If you list the contents of /etc while grepping for apache2, you should see it as a symbolic link:

ls -l /etc | grep apache2

The directory listing will look similar to the following:

lrwxrwxrwx 1 root root 37 2020-06-25 20:59 apache2 -> /git/apache2

If you reload Apache, nothing should change and it should find the same configuration files as it did before, since its directory in /etc maps to /git/apache2, which includes the same files it did before:

sudo systemctl reload apache2

If you see no errors, you should be all set. Otherwise, make sure you created the symbolic link properly.

Next, we get to the main attraction. We've copied Apache's files into our repository, but we didn't actually push those changes back to our Git server yet. To set that up, we'll need to associate the files within our /git/apache2 directory into version control. The reason for this is the files simply being in the git repository folder isn't enough for Git to care about them. We have to tell Git to pay attention to individual files. We can add every file within our Git repository for Apache by entering the following command from within that directory:

git add

This basically tells Git to add everything in the directory to version control. You can actually do the following to add an individual file:

git add <filename>

In this case, we want to add everything, so we used a period in place of a directory name to add the entire current directory.

If you run the git status command from within your Git repository, you should see output indicating that Git has new files that haven't been committed yet. A Git commit simply finalizes the changes locally. Basically, it packages up your current changes to prepare them for being copied to the server. To create a commit of all the files we've added so far, cd into your /git/apache2 directory and run the following to stage a new commit:

git commit -a -m "My first commit."

With this command, the -a option tells Git that you want to include anything that's changed in your repository. The -m option allows you to attach a message to the commit, which is actually required. If you don't use the -m option, it will open your default text editor and allow you to add a comment from there.

Finally, we can push our changes back to the Git server:

git push origin master

By default, the git suite of commands utilizes OpenSSH, so our git push command should create an SSH connection back to our server and push the files there. You won't be able to inspect the contents of the Git directory on your Git server, because it won't contain the same file structure as your original directory. Whenever you pull a Git repository though, the resulting directory structure will be just as you left it.

From this point forward, if you need to restore a repository onto another server, all you should need to do is perform a Git clone. To clone the repository into your current working directory, execute the following:

git clone 192.168.1.101:/git/apache2

Now, each time you make changes to your configuration files, you can perform a git commit and then push the changes up to the server to keep the content safe:

git commit -a -m "Updated config files."git push origin master

Now we know how to create a repository, push changes to a server, and pull the changes back down. Finally, we'll need to know how to revert changes should our configuration get changed with non-working files. First, we'll need to locate a known working commit. My favorite method is using the tig command. The tig package must be installed for this to work, but it's a great utility to have:

sudo apt install tig

The tig command (which is just git backward) gives us a semi-graphical interface to browse through our Git commits. To use it, simply execute the tig command from within a Git repository. In the following example screenshot, I've executed tig from within a Git repository on one of my servers:

Figure 23.1: An example of the tig command, looking at a repository for the bind9 daemon

While using tig, you'll see a list of Git commits, along with their dates and comments that were entered with each. To inspect one, press the up and down arrows to change your selection, then press Enter on the one you want to view. You'll see a new window, which will show you the commit hash (which is a long string of alphanumeric characters), as well as an overview of which lines were added or removed from the files within the commit. To revert one, you'll first need to find the commit you want to revert to and get its commit hash. The tig command is great for finding this information. In most cases, the commit you'll want to revert to is the one before the change took place. In my example screenshot, I fixed the syntax issue on 9/17/2020. If I want to restore that file, I should revert to the commit below that. I can get the commit hash by highlighting that entry and pressing Enter. It's at the top of the window. Then, I can exit tig by pressing q, and then revert to that commit:

git checkout 356dd6153f187c1918f6e2398aa6d8c20fd26032

And just like that, the entire directory tree for the repository instantly changes to exactly what it was before the bad commit took place. I can then restart or reload the daemon for this repository, and it will be back to normal. At this point, you'd want to test the application to make sure that the issue is completely fixed. After some time has passed and you're finished testing, you can make the change permanent. First, we switch back to the most recent commit:

git checkout master

Then, we permanently switch back to the commit that was found to be working properly:

git revert --no-commit 356dd6153f187c1918f6e2398aa6d8c20fd26032

Then, we can commit our reverted Git repository and push it back to the server:

git commit -a -m "The previous commit broke the application. Reverting."git push origin master

As you can see, Git is a very useful ally to utilize when managing configuration files on your servers. This benefits disaster recovery, because if a bad change is made that breaks a daemon, you can easily revert the change. If the server were to fail, you can recreate your configuration almost instantly by just cloning the repository again. There's certainly a lot more to Git than what we've gone over in this section, so feel free to pick up a book about it if you wish to take your knowledge to the next level. But in regard to managing your configuration with Git, all you'll need to know is how to place files into version control, update them, and clone them to new servers. Some services you run on a server may not be a good candidate for Git, however. For example, managing an entire MariaDB database via Git would be a nightmare, since there is too much overhead with such a use case, and database entries would likely change too rapidly for Git to keep up. Use your best judgment. If you have some configuration files that are only manipulated every once in a while, they'll be a perfect candidate for Git.

Backups are one of those things that some people don't seem to take seriously until it's too late. Data loss can be a catastrophic event for an organization, so it's imperative that you implement a solid backup plan. In the next section, we'll look at what that entails.

Implementing a backup plan

Creating a solid backup plan is one of the most important things you'll ever do as a server administrator. Even if you're only using Ubuntu Server at home as a personal file server, backups are critical. During my career, I've seen disks fail many times. I often hear arguments about which hard disk manufacturer beats others in terms of longevity, but I've seen disk failures so often, I don't trust any of them. All disks will fail eventually, it's just a matter of when. And when they do fail, they'll usually fail hard with no easy way to recover data from them. A sound approach to managing data is that any disk or server can fail, and it won't matter, since you'll be able to regenerate your data from other sources, such as a backup or secondary server.

There's no one best backup solution, since it all depends on what kind of data you need to secure, and what software and hardware resources are available to you. For example, if you manage a database that's critical to your company, you should back it up regularly. If you have another server available, set up a replication secondary server so that your primary database isn't a single point of failure. Not everyone has an extra server lying around, so sometimes you have to work with what you have available. This may mean that you'll need to make some compromises, such as creating regular snapshots of your database server's storage volume or regularly dumping a backup of your important databases to an external storage device.

The rsync utility is one of the most valuable pieces of software around to server administrators. It allows us to do some really wonderful things. In some cases, it can save us quite a bit of money. For example, online backup solutions are wonderful in the sense that we can use them to store off-site copies of our important files. However, depending on the volume of data, they can be quite expensive. With rsync, we can back up our data in much the same way, with not only our current files copied over to a backup target but also differentials. If we have another server to send the backup to, even better.

At one company I managed servers for, they didn't want to subscribe to an online backup solution. To work around that, a server was set up as a backup point for rsync. We set up rsync to back up to the secondary server, which housed quite a lot of files. Once the initial backup was complete, the secondary server was sent to one of our other offices in another state. From that point forward, we only needed to run rsync weekly, to back up everything that had been changed since the last backup. Sending files via rsync to the other site over the internet was rather slow, but since the initial backup was already complete before we sent the server there, all we needed to back up each week was differentials. Not only is this an example of how awesome rsync is and how we can configure it to do pretty much what paid solutions do but the experience was also a good example of utilizing what you have available to you.

Since we've already gone over rsync in Chapter 12, Sharing and Transferring Files, I won't repeat too much of that information here. But since we're on the subject of backing up, the --backup-dir option is worth mentioning again. This option allows you to copy files that would normally be replaced to another location. As an example, here's the rsync command I mentioned in Chapter 12, Sharing and Transferring Files:

CURDATE=$(date +%m-%d-%Y)
export $CURDATE
sudo rsync -avb --delete --backup-dir=/backup/incremental/$CURDATE /src /target  

This command was part of the topic of creating an rsync backup script. The first command simply captures today's date and stores it in a variable named $CURDATE. In the actual rsync command, we refer to this variable. The -b option (part of the -avb option string) tells rsync to make a copy of any file that would normally be replaced. If rsync is going to replace a file on the target with a new version, it will move the original file to a new name before overwriting it. The --backup-dir option tells rsync that when it's about to overwrite a file, to put it somewhere else instead of copying it to a new name. We give the --backup-dir option a path, where we want the files that would normally be replaced to be copied to. In this case, the backup directory includes the $CURDATE variable, which will be different every day. For example, a backup run on 8/16/2020 would have a backup directory of the following path, if we used the command I gave as an example:

/backup/incremental/8-16-2020

This essentially allows you to keep differentials. Files on /src will still be copied to /target, but the directory you identify as --backup-dir will contain the original files before they were replaced that day.

On my servers, I use the --backup-dir option with rsync quite often. I'll typically set up an external backup drive, with the following three folders:

  • Current
  • Archive
  • logs

The Current directory always contains a current snapshot of the files on my server. The Archive directory on my backup disks is where I point the --backup-dir option. Within that directory will be folders named with the dates that the backups were taken. The logs directory contains log files from the backup. Basically, I redirect the output of my rsync command to a log file within that directory, each log file being named with the same $CURDATE variable so I'll also have a backup log for each day the backup runs. I can easily look at any of the logs for which files were modified during that backup, and then traverse the archive folder to find an original copy of a file. I've found this approach to work very well. Of course, this backup is performed with multiple backup disks that are rotated every week, with one always off-site. It's always crucial to keep a backup off-site in case of a situation that could compromise your entire local site.

The rsync utility is just one of many you can utilize to create your own backup scheme. The plan you come up with will largely depend on what kind of data you're wanting to protect and what kind of downtime you're willing to endure.

Ideally, we would have an entire warm site with servers that are carbon copies of our production servers, ready to be put into production should any issues arise, but that's also very expensive, and whether you can implement such a routine will depend on your budget. However, Ubuntu has many great utilities available you can use to come up with your own system that works. If nothing else, utilize the power of rsync to back up to external disks and/or external sites.

With the concept of cloud computing becoming more and more popular, replacing failed RAID disks is something that you may not ever have to do. But if you do have servers in use that utilize RAID, then it's a good idea to at least understand the procedure in general. In the next section, we'll take a look at that process.

Replacing failed RAID disks

RAID is a very useful technology, as it can help your server survive the crash of a single disk. RAID is not a backup solution, but more of a safety net that will hopefully prevent you from having to reload a server. The idea behind RAID is having redundancy, so that data is mirrored or striped among several disks. With most RAID configurations, you can survive the loss of a single disk, so if a disk fails, you can usually replace it and re-sync and be back to normal. The server itself will continue to work, even if there is a failed disk. However, losing additional disks will likely result in failure right away. When a RAID disk fails, you will need to replace that disk as quick as you can, hopefully before another disk goes too.

The default live installer for Ubuntu Server doesn't offer a RAID setup option, but the alternate installer does. If you wish to set up Ubuntu Server, check out the Appendix at the end of this book.

To check the status of a RAID configuration, you would use the following command:

cat /proc/mdstat

This will produce output like the following:

Figure 23.2: A healthy RAID array

In Figure 23.2, we have a RAID 1 array with two disks. We can tell this from the active raid1 portion of the output. On the next line down, we see this:

[UU]

Believe it or not, this references a healthy RAID array, which means both disks are online and are working properly. If any one of the U signifiers changes to an underscore, then that means a disk has gone offline and we will need to replace it. Here's a screenshot showing output from that same command on a server with a failed RAID disk:

Figure 23.3: RAID status output with a faulty drive

As you can see from Figure 23.3, we have a problem. The /dev/sda disk is online, but /dev/sdb has gone offline. So, what should we do? First, we would need to make sure we understand which disk is working, and which disk is the one that's faulty. We already know that the disk that's faulty is /dev/sdb, but when we open the server's case, we're not going to know which disk /dev/sdb actually is. If we pull the wrong disk, we could make this problem much worse than it already is. We can use the hdparm command to get a little more info from our drive:

sudo hdparm -i /dev/sda

This will give us info regarding /dev/sda, the disk that's currently still functioning properly:

Figure 23.4: Output of the hdparm command

The reason why we're executing this command against a working drive is we want to make sure we understand which disk we should NOT remove from the server. Also, the faulty drive may not respond to our attempts to interrogate information from it. Currently, /dev/sda is working fine, so we will not want to disconnect the cables attached to that drive at any point. If you have a RAID array with more than two disks, you'll want to execute the hdparm command against each. From the output of the hdparm command, we can see that /dev/sda has a serial number of 45M4B24AS. When we look inside the case, we can compare the serial number on the drives' labels and make sure we do not remove the drive with this serial number.

Next, assuming we already have the replacement disk on hand, we will want to power down the server. Depending on what the server is used for, we may need to do this after hours, but we typically cannot remove a disk while a server is running. Once it's shut down, we can narrow down which disk /dev/sdb is (or whatever drive designation the failed drive has) and replace it. Then, we can power on the server (it will probably take much longer to boot this time; that's to be expected given our current situation).

However, simply adding a replacement disk will not automatically resolve this issue. We need to add the new disk to our RAID array for it to accept it and rebuild the RAID. This is a manual process. The first step in rebuilding the RAID array is finding out which designation our new drive received, so we know which disk we are going to add to the array. After the server boots, execute the following command:

sudo fdisk -l

You'll see output similar to the following:

Figure 23.5: Checking current disks with fdisk

From the output, it should be obvious which disk the new one is. /dev/sda is our original disk, and /dev/sdb is the one that was just added. To make it more obvious, we can see from the output that /dev/sda has a partition, of type Linux raid autodetect. /dev/sdb doesn't have this.

So now that we know which disk is the new one, we can add it to our RAID array. First, we need to copy over the partition tables from the first disk to the new one. The following command will do that:

sudo sfdisk -d /dev/sda | sfdisk sudo /dev/sdb

Essentially, we are cloning the partition table from /dev/sda (the working drive) to /dev/sdb (the one we just replaced). If you run the same fdisk command we ran earlier, you should see that they both have partitions of type Linux raid autodetect now:

sudo fdisk -l

Now that the partition table has been taken care of, we can add the replaced disk to our array with the following command:

sudo mdadm --manage /dev/md0 --add /dev/sdb1

You should see output similar to the following:

mdadm: added /dev/sdb1

With this command, we are essentially adding the /dev/sdb1 disk to a RAID array designated as /dev/md0. With the last part, you want to make sure you're executing this command against the correct array designation. If you don't know what that is, you will see it in the output of the fdisk command we executed earlier.

Now, we should verify that the RAID array is rebuilding properly. We can check this with the same command we always use to check the RAID status:

cat /proc/mdstat

This will produce output like the following:

Figure 23.6: Checking the RAID status after replacing a disk

In Figure 23.6, you can see that the RAID array is in recovery mode. Recovery mode itself can take quite a while to complete, sometimes even overnight depending on how much data it needs to re-sync. This is why it's very important to replace a RAID disk as soon as possible. Once the recovery is complete, the RAID array is marked healthy and you can rest easy.

A tool that is very valuable when working to recover physical servers is USB recovery media, such as flash drives with a bootable ISO image written to them. In the next section, we'll take a look.

Utilizing bootable recovery media

The concept of live media is a wonderful thing, as we can boot into a completely different working environment from the operating system installed on our device and perform tasks without disrupting installed software on the host system. The desktop version of Ubuntu, for example, offers a complete computing environment we can use in order to not only test hardware and troubleshoot our systems, but also to browse the web just as we would on an installed system. In terms of recovering from disasters, live media becomes a saving grace.

As administrators, we run into one problem after another. This gives us our job security. Computers often seemingly have a mind of their own, failing when least expected (as well as seemingly every holiday). Our servers and desktops can encounter a fault at any time, and live media allows us to separate hardware issues from software issues, by troubleshooting from a known good working environment.

One of my favorites when it comes to live media is the desktop version of Ubuntu. Although geared primarily toward end users who wish to install Ubuntu on a laptop or desktop, as administrators we can use it to boot a machine that normally wouldn't, or even recover data from failed disks. For example, I've used the Ubuntu live media to recover data from both failed Windows and Linux systems, by booting the machine with the live media and utilizing a network connection to move data from the bad machine to a network share. Often, when a computer or server fails to boot, the data on its disk is still accessible. Assuming the disk wasn't encrypted during installation, you should have no problem accessing data on a server or workstation using live media such as the Ubuntu live media.

Sometimes, certain levels of failure require us to use different tools. While Ubuntu's live media is great, it doesn't work for absolutely everything. One situation is a failing disk. Often, you'll be able to recover data using Ubuntu's live media from a failing disk, but if it's too far gone, then the Ubuntu media will have difficulty accessing data from it as well. Tools such as Clonezilla specialize in working with hard disks and may be a better choice.

Live media can totally save the day. The Ubuntu live image in particular is a great boot disk to have available to you, as it gives you a very extensive environment you can use to troubleshoot systems and recover data.

One of the best aspects of using the Ubuntu live image is that you won't have to deal with the underlying operating system and software set at all, you can bypass both by booting into a known working desktop, and then copy any important files from the drive right onto a network share. Another important feature of Ubuntu live media is the memory test option. Quite often, strange failures on a computer can be traced to defective memory. Other than simply letting you install Ubuntu, the live media is a Swiss Army knife of many tools you can use to recover a system from disaster. If nothing else, you can use live media to pinpoint whether a problem is software- or hardware-related. If a problem can only be reproduced in the installed environment but not in a live session, chances are a configuration problem is to blame. If a system also misbehaves in a live environment, it may help you identify a hardware issue. Either way, every good administrator should have live media available to troubleshoot systems and recover data when the need arises.

Summary

In this chapter, we looked at several ways in which we can prevent and recover from disasters. Having a sound prevention and recovery plan in place is an important key to managing servers efficiently. We need to ensure we have backups of our most important data ready for whenever servers fail, and we should also keep backups of our most important configurations. Ideally, we'll always have a warm site set up with preconfigured servers ready to go in a situation where our primary servers fail, but one of the benefits of open source software is that we have a plethora of tools available to us we can use to create a sound recovery plan. In this chapter, we looked at leveraging rsync as a useful utility for creating differential backups, and we also looked into setting up a Git server we can use for configuration management, which is also a crucial aspect of any sound prevention plan. We also talked about the importance of live media in diagnosing issues.

And with this chapter, this book comes to a close. Writing this book has been an extremely joyful experience. I was thrilled to write the first edition when Ubuntu 16.04 was in development, it was a fun project to write the second edition and update it to cover Ubuntu 18.04, and I'm even more thrilled to have had the opportunity to update this body of work for 20.04 in this latest edition. I'd like to thank each and every one of you, my readers, for taking the time to read this book. In addition, I would like to thank all of the viewers of my YouTube channel, LearnLinux.tv, because I probably wouldn't have had the opportunity to write this in the first place had it not been for my viewers helping make my channel so popular.

I'd also like to thank Packt Publishing for giving me the opportunity to write a book about one of my favorite technologies. Writing this book was definitely an honor. When I first started with Linux in 2002, I never thought I'd actually be an author, teaching the next generation of Linux administrators the tricks of the trade. I wish each of you the best of luck, and I hope this book is beneficial to you and your career.

For additional content, be sure to check out https://learnlinux.tv for even more content. I have quite a few training videos available there, free of charge.

Further reading

Share your experience

Thank you for taking the time to read this book. If you enjoyed this book, help others to find it. Leave a review at https://www.amazon.com/dp/1800564643

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.25.32