Recovering from a NameNode failure

The NameNode is the single most important Hadoop service. It maintains the locations of all of the data blocks in the cluster; in addition, it maintains the state of the distributed filesystem. When a NameNode fails, it is possible to recover from a previous checkpoint generated by the Secondary NameNode. It is important to note that the Secondary NameNode is not a backup for the NameNode. It performs a checkpoint process periodically. The data is almost certainly stale when recovering from a Secondary NameNode checkpoint. However, recovering from a NameNode failure using an old filesystem state is better than not being able to recover at all.

Getting ready

It is assumed that the system hosting the NameNode service has failed, and the Secondary NameNode is running on a separate machine. In addition, the fs.checkpoint.dir property should have been set in the core-default.xml file. This property tells the Secondary NameNode where to save the checkpoints on the local filesystem.

How to do it...

Carry out the following steps to recover from a NameNode failure:

  1. Stop the Secondary NameNode:
    $ cd /path/to/hadoop
    $ bin/hadoop-daemon.sh stop secondarynamenode
  2. Bring up a new machine to act as the new NameNode. This machine should have Hadoop installed, be configured like the previous NameNode, and ssh password-less login should be configured. In addition, it should have the same IP and hostname as the previous NameNode.
  3. Copy the contents of fs.checkpoint.dir on the Secondary NameNode to the dfs.name.dir folder on the new NameNode machine.
  4. Start the new NameNode on the new machine:
    $ bin/hadoop-daemon.sh start namenode
  5. Start the Secondary NameNode on the Secondary NameNode machine:
    $ bin/hadoop-daemon.sh start secondarynamenode
  6. Verify that the NameNode started successfully by looking at the NameNode status page http://head:50070/.

How it works...

We first logged into the Secondary NameNode and stopped the service. Next, we set up a new machine in the exact manner we set up the failed NameNode. Next, we copied all of the checkpoint and edit files from the Secondary NameNode to the new NameNode. This will allow us to recover the filesystem status, metadata, and edits at the time of the last checkpoint. Finally, we restarted the new NameNode and Secondary NameNode.

There's more...

Recovering using the old data is unacceptable for certain processing environments. Instead, another option would be to set up some type of offsite storage where the NameNode can write its image and edits files. This way, if there is a hardware failure of the NameNode, you can recover the latest filesystem without resorting to restoring old data from the Secondary NameNode snapshot.

The first step in this would be to designate a new machine to hold the NameNode image and edit file backups. Next, mount the backup machine on the NameNode server. Finally, modify the hdfs-site.xml file on the server running the NameNode to write to the local filesystem and the backup machine mount:

$ cd /path/to/hadoop
$ vi conf/hdfs-site.xml
<property>
    <name>dfs.name.dir</name>
    <value>/path/to/hadoop/cache/hadoop/dfs, /path/to/backup</value>
</property>

Now the NameNode will write all of the filesystem metadata to both /path/to/hadoop/cache/hadoop/dfs and the mounted /path/to/backup folders.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.142.56