Implementing HDFS High Availability

Setting up a cluster is just one of the responsibilities of a Hadoop administrator. Once the cluster is up and running, the administrator needs to make sure the environment is stable and should handle downtime efficiently. Hadoop, being a distributed system, is not only prone to failures, but is expected to fail. The master nodes such as the namenode and jobtracker are single points of failure. A single point of failure (SPOF) is a system in the cluster, if it fails, it causes the whole cluster to be nonfunctional. Having a system to handle these single point failures is a must. We will be exploring the techniques on how to handle namenode failures by configuring HDFS HA (high availability).

The namenode stores all the location information of the files in a cluster and coordinates access to the data. If the namenode goes down, the cluster is unusable until the namenode is brought up. Maintenance windows to upgrade hardware or software on the namenode could also cause downtime. The secondary namenode, as we have already discussed, is a checkpoint service and does not support automatic failover for the namenode.

The time taken to bring back the namenode online depends on the type of failure (hardware and software). The downtime could result in Service Level Agreement (SLA) slippage as well as the productivity of the data team. To handle such issues and make the namenode more available, namenode HA was built and integrated into Apache Hadoop 2.0.0.

CDH5 comes with HDFS HA built-in. HDFS HA is achieved by running two namenodes for a cluster in an active/passive configuration. In an active/passive configuration, only one namenode is active at a time. When the active namenode becomes unavailable, the passive namenode assumes responsibility and makes itself as the primary active namenode. In this configuration, the two namenodes run on two physically different machines.

The two namenodes in an active/passive configuration are called active namenode and standby namenode respectively. Both the active as well as the standby namenode need to be of similar hardware.

Using CDH5, high availability can be configured using the following two techniques:

  • Quorum-based storage
  • Shared storage using NFS

The Quorum-based storage

In the Quorum-based storage HA technique, the two namenodes use a Quorum Journal Manager (QJM) to share edit logs. As we already know, edits log is the activity log of all file operations on the cluster. Both the active namenode and the standby namenode need to have their edit logs file in sync. To achieve this, the active namenode communicates with the standby namenode using the JournalNode (JN) daemons. The active namenode reports every change of its edits logs file to the JournalNodes daemons. The standby namenode reads the edit logs from the JournalNodes daemons and applies all changes to its own namespace. By doing this, the two namenodes are always in sync.

In the following diagram, you see a typical architecture of HDFS HA using QJM. There are two namenodes, active and standby, that communicate with each other via JournalNodes.

The Quorum-based storage

The JournalNodes are daemons that run on JournalNode machines. It is advisable to use at least three JournalNode daemons to make sure the edit logs are written to three different locations allowing JournalNodes to tolerate the failure of a single machine. The system can tolerate the failure of (N-1)/2 JournalNodes, where N is the number of JournalNodes configured.

At any time, only the active namenode can write to the JournalNodes. This is handled by the JournalNodes to avoid the updates of the shared edit logs by two namenodes, which could cause data loss as well as incorrect results. The standby namenode will only have read access to the JournalNodes to update its own namespace.

When configuring HA, the secondary namenode is not used. As you will recall, the secondary namenode is a checkpoint service that performs periodic checkpoints of the edit logs in the primary namenode and updates the fsimage. The standby namenode in an HA environment performs the checkpoints and does not need a secondary namenode.

The datanodes of the cluster update both the active as well as the standby namenode with the location and heartbeat information, thus making it possible for a quick failover.

When the active namenode fails, the standby namenode assumes the responsibility of writing to the JournalNodes and takes over the active role.

Configuring HDFS high availability by theQuorum-based storage

There are two types of failover configurations: manual failover, which involves manually initiating the commands for failover and automatic failover, where there is no manual intervention.

To configure HDFS HA, a new property, NamenodeID is used. The NamenodeID is used to distinguish each namenode in the cluster.

Let's look at the properties defined in hdfs-site.xml configuration file to set up HDFS HA:

  • dfs.nameservices: Just as we did for HDFS Federation, we need to configure the nameservices in use for the cluster. For this configuration, I am not using a federated HDFS. The following is a sample configuration entry for dfs.nameservices:
    <property>
      <name>dfs.nameservices</name>
      <value>hcluster</value>
    </property>
  • dfs.ha.namenodes.[NameserviceID]: This property defines the unique identifiers for the namenodes in the cluster. The IDs mentioned as values for this property are also referred to as NamenodeID. The following is a sample configuration entry for dfs.ha.namenodes:
    <property>
      <name>dfs.ha.namenodes.hcluster</name>
      <value>nn1,nn2</value>
    </property>
  • dfs.namenode.rpc-address.[NameserviceID].[name node ID]: This property defines the fully qualified RPC address for each configured namenode in the cluster. The following is a sample configuration entry for dfs.namenode.rpc-address:
    <property>
      <name>dfs.namenode.rpc-address.hcluster.nn1</name>
      <value>node1.hcluster:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.hcluster.nn2</name>
      <value>node2.hcluster:8020</value>
    </property>
  • dfs.namenode.http-address.[NameserviceID].[name node ID]: This property defines the fully qualified HTTP address for each namenode in the cluster. The following is a sample configuration entry for dfs.namenode.http-address:
    <property>
      <name>dfs.namenode.http-address.hcluster.nn1</name>
      <value>node1.hcluster:50070</value>
    </property>
    <property>
      <name>dfs.namenode.http-address.hcluster.nn2</name>
      <value>node2.hcluster:50070</value>
    </property>
  • dfs.namenode.shared.edits.dir: This property defines the URI that identifies the group of JournalNodes to which the NameNodes will read and write edits. The values are a list of the JournalNode addresses. These addresses point to the location where the active namenode writes the edit logs and is subsequently read by the standby namenode. The values are semicolon separated. The following is a sample configuration entry for dfs.namenode.shared.edits.dir:
    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://node3.hcluster:8485;node4.hcluster:8485;node5.hcluster:8485/hcluster</value>
    </property>
  • dfs.client.failover.proxy.provider.[NameserviceID]: This property defines the Java class that HDFS clients use to contact the active namenode. This property helps clients identify the active namenode. Developers can write custom classes for this property. The default class that it comes with Hadoop is ConfiguredFailoverProxyProvider. The following is a sample configuration entry for dfs.client.failover.proxy.provider:
    <property>
      <name>dfs.client.failover.proxy.provider.hcluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
  • fs.defaultFS: This property defines the default path prefix used by the Hadoop FS client when none is given. This property is defined in the core-site.xml file. The following is a sample configuration entry for fs.defaultFS:
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://hcluster</value>
    </property>
  • dfs.journalnode.edits.dir: This is the complete path of the location where the edits and other local state files on the machines running the JournalNode service are stored. The following is a sample configuration entry for dfs.journalnode.edits.dir:
    <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/tmp/jnode</value>
    </property>

Apart from the preceding properties, there are properties that are meant for fencing. Fencing is a way to assure that only one namenode writes to the JournalNodes. However, it is possible that when a failover is initiated, the previous active namenode may still serve client requests till the namenode shuts down. The previous namenode shuts down when it tries to write to the JournalNodes. Using fencing methods, this namenode can be shut down as the failover is initiated.

The following are two fencing methods:

  • sshfence
  • shell

The type of fencing to be used is defined by the dfs.ha.fencing.methods property, which is defined by the hdfs-site.xml file.

The sshfence option provides a way to SSH into the target node and use the fuser command to kill the process listening on the TCP port. In other words, it kills the previously active namenode. To perform this action, the SSH needs to happen without a passphrase. For this to work, the dfs.ha.fencing.ssh.private-key-files property needs to be configured. The following is a sample configuration entry to set up fencing:

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/admin/.ssh/id_rsa</value>
</property>

In the preceding configuration, we are using the sshfence option. The private keys of username, admin, are being used to perform an SSH without passphrase.

Another way to configure sshfence would be to use a nonstandard username and port to connect via SSH as shown in the following code:

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence([[username][:port]])</value>
</property>
<property>
  <name>dfs.ha.fencing.ssh.connect-timeout</name>
  <value>30000</value>
</property>

A timeout property can be configured to time out the SSH session. If the SSH times out, the fencing mechanism is considered to have failed.

The shell fencing option provides a way to run arbitrary shell commands to fence the active namenode as shown in the following code:

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/path/to/my/script.sh –namenode=$target_host–nameservice=$target_nameserviceid)</value>
</property>

The following is the list of variables with reference to the node that needs to be fenced:

Variable

Description

$target_host

The hostname of the node to be fenced

$target_port

The IPC port of the node to be fenced

$target_address

The combination of the preceding two is configured as <host:port>

$target_nameserviceid

The NameserviceID of the namenode to be fenced

$target_namenodeid

The NamenodeID of the namenode to be fenced

If a fencing method returns 0, the fencing operation is considered successful. Though fencing is not mandatory for HDFS HA using Quorum Journal Manager, it is highly recommended. If no fencing is configured, a default value needs to be set for the property as shown in the following code:

<property>
  <name>dfs.ha.fencing.methods</name>
  <value> shell(/bin/true)</value>
</property>

Once the configuration files are updated, copy the configurations to all the nodes in the cluster.

Shared storage using NFS

In this approach, the active and standby namenodes need to have access to a common shared storage such as Network File System (NFS). The active namenode logs a copy of its own namespace modifications on the shared network storage. The standby namenode reads these modifications and applies them to its own namespace and keeps its namespace in sync with the active namenode. In the event of a failover, the standby namenode would have to wait till all the operations from the shared edits log have been applied to its namespace before transitioning into the active state. All datanodes are configured to send to block information and heartbeats to both namenodes. A fencing method needs to be deployed for this approach to make sure that the previously active namenode is shut down before the standby namenode becomes active.

The hardware for the namenode machines in this architecture should be equivalent and just like the hardware setup of a non-HA namenode. The shared storage should be accessible by both the namenodes and should be configured for redundancy to handle failures. Redundancy should be configured for the disk, network, and power. The access and hardware of the shared storage are very critical to this architecture and should be of a high quality with multiple network access paths. Redundancy prevents the NFS (shared storage) from becoming the single point of failure.

Configuring HDFS high availability by shared storage using NFS

Almost all the configuration parameters for the hdfs-site.xml configuration file are similar to the one we did for Quorum-based storage. However, we need to update the property as shown in the following code to set up HDFS HA using NFS:

<property>
  <name>dfs.namenode.shared.edits.dir</name>        
  <value>file:///mnt/shared_storage</value>
</property>

Here, the property dfs.namenode.shared.edits.dir points to a shared directory, which has been locally mounted.

Once you are done deciding and configuring the desired method for HDFS HA (Quorum-based or Shared Storage), you need to perform the following steps:

  1. Stop all Hadoop daemons from hduser using the following command on every node:
    $ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x stop ; done
    
  2. Install the namenode package from hduser on the node you want to configure as standby. As per our configuration, it is node2.hcluster.
    $ sudo yum install hadoop-hdfs-namenode
    
  3. If you are using the Quorum-based storage approach, install the JournalNode package from hduser on the nodes you want to use as JournalNodes. As per our configuration, we would need to install the JournalNode package on node3.hcluster, node4.hcluster, and node5.hcluster:
    $ sudo yum install hadoop-hdfs-journalnode
    

    This step can be skipped if you are using the Shared Storage approach.

  4. Start the JournalNode daemon using the following command on all the nodes where they will run:
    sudo service hadoop-hdfs-journalnode start
    

    In our configuration, the nodes are node3.hcluster, node4.hcluster, and node5.hcluster.

  5. Next, go to the primary namenode and execute the following command from hduser to initialize the shared edits directory from the local namenode edits directory:
    $ sudo -u hdfs hdfs namenode -initializeSharedEdits
    
  6. Next, start the primary namenode from hduser using the following command:
    $ sudo service hadoop-hdfs-namenode start
    
  7. Start the standby namenode from hduser using the following command:
    $ sudo -u hdfs hdfs namenode -bootstrapStandby
    $ sudo service hadoop-hdfs-namenode start
    
  8. Restart all Hadoop daemons from hduser on all the nodes using the following command:
    $ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x start ; done
    

The preceding steps should start the namenodes on node1.hcluster as well as node2.hcluster along with the other Hadoop daemons. When a namenode starts, it is initially in the standby mode.

Use the hdfs haadmin command to perform the various administrative operations for HDFS HA. To see all the options available with this command, use the hdfs haadmin –help command as shown in the following code:

$ hdfs haadmin -help
Usage: DFSHAAdmin [-ns <nameserviceId>]
    [-transitionToActive <serviceId>]
    [-transitionToStandby <serviceId>]
    [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
    [-getServiceState <serviceId>]
    [-checkHealth <serviceId>]
    [-help <command>]

The following is the description of each option available for the hdfs haadmin command:

  • transitionToActive: This option is used when you want to change the state of a standby namenode to active.
  • transitionToStandby: This option is used when you want to change the state of an active namenode to standby.

    Usage of the preceding two options is not usually done on production systems as they do not support fencing.

  • failover: This option is used to perform a failover of the namenodes. Using this flag, the administrator can set the standby namenode to the active state. For our configuration, we can use the following command to set the namenode on node1.hcluster to active:
    $ sudo -u hdfs hdfs haadmin -failover --forceactive nn2 nn1
    

    After the namenode enters the standby state, it starts checkpointing the fsimage of the active namenode. The fencing configuration takes effect in case of any failures during failover.

  • getServiceState: This option is used to print the current status of the namenode. To print the status of the namenodes we have configured, you can use the following commands:
    $ sudo -u hdfs hdfs haadmin -getServiceState nn1
    $ sudo -u hdfs hdfs haadmin -getServiceState nn2
    
  • checkHealth: This option is used to check the health of the specified namenode. The return value of 0 indicates that the namenode is healthy. A nonzero value is returned if the namenode is unhealthy. As per the current implementation feature, the option will indicate an unhealthy status only if the namenode is down.

The following screenshot shows the summary section of the active namenode:

Configuring HDFS high availability by shared storage using NFS

The following screenshot shows the summary section of the standby namenode:

Configuring HDFS high availability by shared storage using NFS

NameNode Journal Status for Quorum-based storage approach

In the Quorum-based storage approach, only the active namenode will be allowed to perform write operations to the JournalNodes. As shown in the following screenshot, this information is displayed under the NameNode Journal Status section:

NameNode Journal Status for Quorum-based storage approach

The following screenshot shows the NameNode Journal Status of the standby namenode. In the Quorum-based storage approach, the standby namenode is only allowed to perform read operations on the JournalNodes. Also, this information is displayed under the NameNode Journal Status section:

NameNode Journal Status for Quorum-based storage approach

Using the preceding steps, an administrator can perform a manual transition from one namenode to the other.

NameNode Journal Status for the Shared Storage-based approach

The following screenshot shows the NameNode Journal Status for the standby namenode in the Shared storage-based approach. As you can see in the following screenshot, the standby namenode in the Shared Storage-based approach is only allowed to read from the shared storage:

NameNode Journal Status for the Shared Storage-based approach

As you can see in the following screenshot, the active namenode in a Shared Storage configuration is allowed to write to the common shared location:

NameNode Journal Status for the Shared Storage-based approach

Configuring automatic failover for HDFS high availability

To perform an automatic failover where no manual intervention is required, we need to use Apache Zookeeper. As you saw in Chapter 3, Cloudera's Distribution Including Apache Hadoop, Apache Zookeeper is a distributed coordination service.

To configure automatic failover, the following two additional components are installed on an HDFS deployment:

  • Zookeeper Quorum
  • ZK Failover Controller (ZKFC)

The ZooKeeper service is responsible for the following two operations:

  • Failure detection: The ZooKeeper service is responsible for maintaining a persistent session of the active namenode in the cluster. If the namenode crashes, the session will expire. This will notify the other namenode that a failover needs to be initiated.
  • Active NameNode elections: ZooKeeper implements the feature of leader election that can be used to elect an active namenode whenever a namenode crashes.

The ZKFC component is a ZooKeeper client that helps in monitoring and managing the state of the namenode. A ZKFC service runs on each machine that runs a namenode.

The ZKFC service is responsible for the following operations:

  • Health monitoring: ZKFC performs namenode health checks by pinging its local namenode periodically and expecting a response. The namenode needs to respond to the ping requests consistently and periodically to communicate its good health status. If the response is not received correctly, the ZKFC considers the namenode to be unhealthy or down.
  • ZooKeeper session management: This is the most important operation of the ZooKeeper. The namenode maintains a persistent state in ZooKeeper indicating that it is active and healthy. Along with the session information, the namenode also maintains a special znode known as the "lock" znode. As soon as the active namenode fails, the session expires, notifying us that the namenode has failed.
  • ZooKeeper-based election: The ZKFC constantly checks whether the local namenode is healthy. It constantly keeps an eye on whether there is a lock on the znode held by any other node. If there is no lock, the ZKFC tries to acquire the lock and initiates the failover to set the local namenode as the active namenode.

The ZooKeeper daemons typically run on three or five nodes (number of nodes should be an odd number) and can be collocated with the active and standby namenodes.

Perform the following steps as user hduser to configure ZooKeeper for automatic failover:

  1. Shut down the entire cluster before configuring ZooKeeper using the following command on every node:
    $ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x stop; done
    
  2. Install ZooKeeper on all the nodes that need to be used as ZooKeeper nodes using the following command:
    $ sudo yum install zookeeper-server
    
  3. Start the ZooKeeper service using the following commands:
    $ sudo service zookeeper-server init --myid=1 --force
    $ sudo service zookeeper-server start
    
  4. Install the ZKFC Failover controller on all nodes that host the namenodes using the following command:
    $ sudo yum install hadoop-hdfs-zkfc
    
  5. Update the hdfs-site.xml file to include the following property and copy to all the nodes:
    <property>
      <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
    </property>
  6. Update the core-site.xml to include the following property and copy it to all the nodes:
    <property>
      <name>ha.zookeeper.quorum</name>
      <value>node1.hcluster:2181, node2.hcluster:2181, node3.hcluster:2181</value>
    </property>
  7. Initialize the High Availability (HA) state in ZooKeeper using the following command from one of the namenodes:
    $ hdfs zkfc -formatZK
    

    This command creates a znode in ZooKeeper, which is used by the automatic failover system to store its data.

  8. Restart all Hadoop daemons on all the nodes using the following command:
    $ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x start; done
    
  9. Start the ZooKeeper failover controller on the machines hosting the namenodes using the following command:
    $ sudo service hadoop-hdfs-zkfc start
    

Your cluster is now configured for automatic failover. You can test this configuration by manually killing the active the namenode using kill -9 to see if the failover occurs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.40.189