Chapter 9. Configuring Backups

As Hadoop clusters mature, the data residing in them grows, and maintaining a copy of the data turns out to be an important responsibility of a Hadoop administrator. Backing up data from a distributed environment is a challenge due to its ever increasing volume. Setting up backup operations is an important step towards restoring data in case of entire cluster failures. This chapter discusses the various backup and data protection options and will cover the following topics:

  • Understanding backups
  • Understanding HDFS backups
  • Using the distributed copy (DistCp)
  • Configuring backups using Cloudera Manager

Understanding backups

Systems that deal with information or data, whether standalone or distributed, have to plan for disasters and complete system failures. Configuring and setting up backup policies for systems is an integral part of any disaster recovery plan. A backup is a copy of the data in use, which is also sometimes referred to as an archive. In the case of failures leading to data loss, the copy is used to restore data.

Backups, depending on the days of retention, demand vast amount of data storage. However, in the case of data storage media used for backups, the read and write speeds are not of much importance. Organizations procure more reliable storage media and do not worry about the read/write speeds as data from backup storage media is only read from in the case of disasters.

Types of backups

There are several types of backup policies that can be considered before including them in the disaster recovery plan of your organization. This decision is solely based on the organizations, requirements on how they would want to manage their data. The following are a few of the different types of backups:

  • The full backup: A full backup involves archiving all data from the source location to the target backup location. Almost all backup solutions start with the full backup first and subsequently tend to use the other backup methods. Recurring schedules for full backups are only done on smaller amounts of data. Performing full backups on every schedule for large volumes of data is not advisable as such backups can be very time consuming and would demand larger storage spaces.
  • The incremental backup: Incremental backups involve the archiving of only the changes made to the data since the last backup. The first incremental backup is preceded by a full backup and subsequently when data is changed or added, only the data that has changed/added is backed up. Restoring data from incrementally backed up data can be slower as it involves the process of first restoring the initial full backup and subsequently applying the incremental restores on top.
  • The differential backup: The differential backup involves archiving only the changes made to the data since the last full backup. The important term to note here is full backup. This is what sets differential backups apart from incremental backups. Restoring from differential backups is faster than restoring from incremental backups.
  • The mirror backup: Mirror backups involve the duplication of every operation in the source location to the target location. So in this case, when data is deleted in the source, it is also deleted from the target backup location, maintaining a mirror image of the original data. Using this type of backup could result in loss of data from the backup location in the case of an accidental deletion from the source location.

Types of storage media for backups

There are different types of storage media that can be used for backups. The following is a list of commonly used storage media:

  • Hard disks: Hard disks are the most commonly used storage media for backups as the cost per byte has come down over the years. They are found with several speeds and sizes. Some hard disks are manufactured with a design for backup. With hard disks, it is important to note that they are not particularly reliable for data storage with retention periods that span over several years.
  • Optical storage: Removable and portable media such as recordable compact discs (CD), digital video discs (DVD), and Blu-ray Discs (BD) are also being used for backups. These are particularly used to store small amounts of data and are not used to back up data from environments such as large data clusters.
  • Tape drives: Tapes are probably the oldest forms of backup storage media still in use today in many organizations. This is mainly because of the low cost per byte they offer. However, this is slowly changing as hard disks are now being used to store data.

Using cloud services for backups

In recent years, there have been several cloud service offerings that provide storage space on the cloud to back up an organization's critical data. This eliminates the need for organizations to set up a different physical site to back up their data. Services such as Amazon's AWS provides storage as a service that can be accessed over the Internet. They provide several disaster recovery architectures that make it easy to set up a backup site on the cloud. More information on Amazon's offering can be found at http://aws.amazon.com/disaster-recovery/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.135.107