Understanding HDFS backups

Data volumes in Hadoop clusters range from terabytes to petabytes, and deciding what data to back up from such clusters is an important decision. A disaster recovery plan for Hadoop clusters needs to be formulated right at the cluster planning stages. The organization needs to identify the datasets they want to back up and plan backup storage requirements accordingly.

Backup schedules also need to be considered when designing a backup solution. The larger the data that needs to be backed up, the more time-consuming the activity. It would be more efficient if backups could be performed during a window when there is the least amount of activity on the cluster. This not only helps the backup commands to run efficiently, but also ensures data consistency of the datasets being backed up. Knowing the possible schedules of the data infusion to HDFS in advance helps you to better plan and schedule backup solutions for Hadoop clusters.

The following are some of the important data sources that need to be protected against data loss:

  • The namenode metadata: The namenode metadata contains all the location of all the files in the HDFS.
  • The Hive metastore: The Hive metastore contains the metadata for all Hive tables and partitions.
  • HBase RegionServer data: This contains the information of all the HBase regions.
  • Application configuration files: This comprises the important configuration files required to configure Apache Hadoop. For example, core-site.xml, yarn-site.xml, and hdfs-site.xml.

Data protection in Hadoop clusters is important as clusters are prone to data corruption, hardware failures, and accidental data deletion. In rare cases, a data center catastrophe could also lead to entire data loss.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.124.194