Setting up a cluster is just one of the responsibilities of a Hadoop administrator. Once the cluster is up and running, the administrator needs to make sure the environment is stable and should handle downtime efficiently. Hadoop, being a distributed system, is not only prone to failures, but is expected to fail. The master nodes such as the namenode and jobtracker are single points of failure. A single point of failure (SPOF) is a system in the cluster, if it fails, it causes the whole cluster to be nonfunctional. Having a system to handle these single point failures is a must. We will be exploring the techniques on how to handle namenode failures by configuring HDFS HA (high availability).
The namenode stores all the location information of the files in a cluster and coordinates access to the data. If the namenode goes down, the cluster is unusable until the namenode is brought up. Maintenance windows to upgrade hardware or software on the namenode could also cause downtime. The secondary namenode, as we have already discussed, is a checkpoint service and does not support automatic failover for the namenode.
The time taken to bring back the namenode online depends on the type of failure (hardware and software). The downtime could result in Service Level Agreement (SLA) slippage as well as the productivity of the data team. To handle such issues and make the namenode more available, namenode HA was built and integrated into Apache Hadoop 2.0.0.
CDH5 comes with HDFS HA built-in. HDFS HA is achieved by running two namenodes for a cluster in an active/passive configuration. In an active/passive configuration, only one namenode is active at a time. When the active namenode becomes unavailable, the passive namenode assumes responsibility and makes itself as the primary active namenode. In this configuration, the two namenodes run on two physically different machines.
The two namenodes in an active/passive configuration are called active namenode and standby namenode respectively. Both the active as well as the standby namenode need to be of similar hardware.
Using CDH5, high availability can be configured using the following two techniques:
In the Quorum-based storage HA technique, the two namenodes use a Quorum Journal Manager (QJM) to share edit logs. As we already know, edits log is the activity log of all file operations on the cluster. Both the active namenode and the standby namenode need to have their edit logs file in sync. To achieve this, the active namenode communicates with the standby namenode using the JournalNode (JN) daemons. The active namenode reports every change of its edits logs file to the JournalNodes daemons. The standby namenode reads the edit logs from the JournalNodes daemons and applies all changes to its own namespace. By doing this, the two namenodes are always in sync.
In the following diagram, you see a typical architecture of HDFS HA using QJM. There are two namenodes, active and standby, that communicate with each other via JournalNodes.
The JournalNodes are daemons that run on JournalNode machines. It is advisable to use at least three JournalNode daemons to make sure the edit logs are written to three different locations allowing JournalNodes to tolerate the failure of a single machine. The system can tolerate the failure of (N-1)/2 JournalNodes, where N is the number of JournalNodes configured.
At any time, only the active namenode can write to the JournalNodes. This is handled by the JournalNodes to avoid the updates of the shared edit logs by two namenodes, which could cause data loss as well as incorrect results. The standby namenode will only have read access to the JournalNodes to update its own namespace.
When configuring HA, the secondary namenode is not used. As you will recall, the secondary namenode is a checkpoint service that performs periodic checkpoints of the edit logs in the primary namenode and updates the fsimage. The standby namenode in an HA environment performs the checkpoints and does not need a secondary namenode.
The datanodes of the cluster update both the active as well as the standby namenode with the location and heartbeat information, thus making it possible for a quick failover.
When the active namenode fails, the standby namenode assumes the responsibility of writing to the JournalNodes and takes over the active role.
There are two types of failover configurations: manual failover, which involves manually initiating the commands for failover and automatic failover, where there is no manual intervention.
To configure HDFS HA, a new property, NamenodeID
is used. The NamenodeID
is used to distinguish each namenode in the cluster.
Let's look at the properties defined in hdfs-site.xml
configuration file to set up HDFS HA:
dfs.nameservices
:<property> <name>dfs.nameservices</name> <value>hcluster</value> </property>
NamenodeID
. The following is a sample configuration entry for dfs.ha.namenodes
:<property> <name>dfs.ha.namenodes.hcluster</name> <value>nn1,nn2</value> </property>
dfs.namenode.rpc-address
:<property> <name>dfs.namenode.rpc-address.hcluster.nn1</name> <value>node1.hcluster:8020</value> </property> <property> <name>dfs.namenode.rpc-address.hcluster.nn2</name> <value>node2.hcluster:8020</value> </property>
dfs.namenode.http-address
:<property> <name>dfs.namenode.http-address.hcluster.nn1</name> <value>node1.hcluster:50070</value> </property> <property> <name>dfs.namenode.http-address.hcluster.nn2</name> <value>node2.hcluster:50070</value> </property>
dfs.namenode.shared.edits.dir
:<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node3.hcluster:8485;node4.hcluster:8485;node5.hcluster:8485/hcluster</value> </property>
ConfiguredFailoverProxyProvider
. The following is a sample configuration entry for dfs.client.failover.proxy.provider
:<property> <name>dfs.client.failover.proxy.provider.hcluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property>
core-site.xml
file. The following is a sample configuration entry for fs.defaultFS
:<property> <name>fs.defaultFS</name> <value>hdfs://hcluster</value> </property>
dfs.journalnode.edits.dir
:<property> <name>dfs.journalnode.edits.dir</name> <value>/tmp/jnode</value> </property>
Apart from the preceding properties, there are properties that are meant for fencing. Fencing is a way to assure that only one namenode writes to the JournalNodes. However, it is possible that when a failover is initiated, the previous active namenode may still serve client requests till the namenode shuts down. The previous namenode shuts down when it tries to write to the JournalNodes. Using fencing methods, this namenode can be shut down as the failover is initiated.
The following are two fencing methods:
sshfence
shell
The type of fencing to be used is defined by the dfs.ha.fencing.methods
property, which is defined by the hdfs-site.xml
file.
The sshfence
option provides a way to SSH into the target node and use the fuser
command to kill the process listening on the TCP port. In other words, it kills the previously active namenode. To perform this action, the SSH needs to happen without a passphrase. For this to work, the dfs.ha.fencing.ssh.private-key-files
property needs to be configured. The following is a sample configuration entry to set up fencing:
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/admin/.ssh/id_rsa</value> </property>
In the preceding configuration, we are using the sshfence
option. The private keys of username, admin, are being used to perform an SSH without passphrase.
Another way to configure sshfence
would be to use a nonstandard username and port to connect via SSH as shown in the following code:
<property> <name>dfs.ha.fencing.methods</name> <value>sshfence([[username][:port]])</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property>
A timeout property can be configured to time out the SSH session. If the SSH times out, the fencing mechanism is considered to have failed.
The shell
fencing option provides a way to run arbitrary shell commands to fence the active namenode as shown in the following code:
<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh –namenode=$target_host–nameservice=$target_nameserviceid)</value> </property>
The following is the list of variables with reference to the node that needs to be fenced:
Variable |
Description |
---|---|
| |
| |
|
The combination of the preceding two is configured as <host:port> |
| |
|
If a fencing method returns 0
, the fencing operation is considered successful. Though fencing is not mandatory for HDFS HA using Quorum Journal Manager, it is highly recommended. If no fencing is configured, a default value needs to be set for the property as shown in the following code:
<property> <name>dfs.ha.fencing.methods</name> <value> shell(/bin/true)</value> </property>
Once the configuration files are updated, copy the configurations to all the nodes in the cluster.
In this approach, the active and standby namenodes need to have access to a common shared storage such as Network File System (NFS). The active namenode logs a copy of its own namespace modifications on the shared network storage. The standby namenode reads these modifications and applies them to its own namespace and keeps its namespace in sync with the active namenode. In the event of a failover, the standby namenode would have to wait till all the operations from the shared edits log have been applied to its namespace before transitioning into the active state. All datanodes are configured to send to block information and heartbeats to both namenodes. A fencing method needs to be deployed for this approach to make sure that the previously active namenode is shut down before the standby namenode becomes active.
The hardware for the namenode machines in this architecture should be equivalent and just like the hardware setup of a non-HA namenode. The shared storage should be accessible by both the namenodes and should be configured for redundancy to handle failures. Redundancy should be configured for the disk, network, and power. The access and hardware of the shared storage are very critical to this architecture and should be of a high quality with multiple network access paths. Redundancy prevents the NFS (shared storage) from becoming the single point of failure.
Almost all the configuration parameters for the hdfs-site.xml
configuration file are similar to the one we did for Quorum-based storage. However, we need to update the property as shown in the following code to set up HDFS HA using NFS:
<property> <name>dfs.namenode.shared.edits.dir</name> <value>file:///mnt/shared_storage</value> </property>
Here, the property dfs.namenode.shared.edits.dir
points to a shared directory, which has been locally mounted.
Once you are done deciding and configuring the desired method for HDFS HA (Quorum-based or Shared Storage), you need to perform the following steps:
hduser
using the following command on every node:$ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x stop ; done
hduser
on the node you want to configure as standby. As per our configuration, it is node2.hcluster
.$ sudo yum install hadoop-hdfs-namenode
hduser
on the nodes you want to use as JournalNodes. As per our configuration, we would need to install the JournalNode package on node3.hcluster
, node4.hcluster
, and node5.hcluster
:$ sudo yum install hadoop-hdfs-journalnode
This step can be skipped if you are using the Shared Storage approach.
sudo service hadoop-hdfs-journalnode start
In our configuration, the nodes are node3.hcluster
, node4.hcluster
, and node5.hcluster
.
hduser
to initialize the shared edits directory from the local namenode edits directory:$ sudo -u hdfs hdfs namenode -initializeSharedEdits
hduser
using the following command:$ sudo service hadoop-hdfs-namenode start
hduser
using the following command:$ sudo -u hdfs hdfs namenode -bootstrapStandby $ sudo service hadoop-hdfs-namenode start
hduser
on all the nodes using the following command:$ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x start ; done
The preceding steps should start the namenodes on node1.hcluster
as well as node2.hcluster
along with the other Hadoop daemons. When a namenode starts, it is initially in the standby mode.
Use the hdfs haadmin
command to perform the various administrative operations for HDFS HA. To see all the options available with this command, use the hdfs haadmin –help
command as shown in the following code:
$ hdfs haadmin -help Usage: DFSHAAdmin [-ns <nameserviceId>] [-transitionToActive <serviceId>] [-transitionToStandby <serviceId>] [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>] [-getServiceState <serviceId>] [-checkHealth <serviceId>] [-help <command>]
The following is the description of each option available for the hdfs haadmin
command:
Usage of the preceding two options is not usually done on production systems as they do not support fencing.
node1.hcluster
to active:$ sudo -u hdfs hdfs haadmin -failover --forceactive nn2 nn1
After the namenode enters the standby state, it starts checkpointing the fsimage of the active namenode. The fencing configuration takes effect in case of any failures during failover.
$ sudo -u hdfs hdfs haadmin -getServiceState nn1 $ sudo -u hdfs hdfs haadmin -getServiceState nn2
0
indicates that the namenode is healthy. A nonzero value is returned if the namenode is unhealthy. As per the current implementation feature, the option will indicate an unhealthy status only if the namenode is down.The following screenshot shows the summary section of the active namenode:
The following screenshot shows the summary section of the standby namenode:
In the Quorum-based storage approach, only the active namenode will be allowed to perform write operations to the JournalNodes. As shown in the following screenshot, this information is displayed under the NameNode Journal Status section:
The following screenshot shows the NameNode Journal Status of the standby namenode. In the Quorum-based storage approach, the standby namenode is only allowed to perform read operations on the JournalNodes. Also, this information is displayed under the NameNode Journal Status section:
Using the preceding steps, an administrator can perform a manual transition from one namenode to the other.
The following screenshot shows the NameNode Journal Status for the standby namenode in the Shared storage-based approach. As you can see in the following screenshot, the standby namenode in the Shared Storage-based approach is only allowed to read from the shared storage:
As you can see in the following screenshot, the active namenode in a Shared Storage configuration is allowed to write to the common shared location:
To perform an automatic failover where no manual intervention is required, we need to use Apache Zookeeper. As you saw in Chapter 3, Cloudera's Distribution Including Apache Hadoop, Apache Zookeeper is a distributed coordination service.
To configure automatic failover, the following two additional components are installed on an HDFS deployment:
The ZooKeeper service is responsible for the following two operations:
The ZKFC component is a ZooKeeper client that helps in monitoring and managing the state of the namenode. A ZKFC service runs on each machine that runs a namenode.
The ZKFC service is responsible for the following operations:
The ZooKeeper daemons typically run on three or five nodes (number of nodes should be an odd number) and can be collocated with the active and standby namenodes.
Perform the following steps as user hduser
to configure ZooKeeper for automatic failover:
$ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x stop; done
$ sudo yum install zookeeper-server
$ sudo service zookeeper-server init --myid=1 --force $ sudo service zookeeper-server start
$ sudo yum install hadoop-hdfs-zkfc
hdfs-site.xml
file to include the following property and copy to all the nodes:<property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property>
core-site.xml
to include the following property and copy it to all the nodes:<property> <name>ha.zookeeper.quorum</name> <value>node1.hcluster:2181, node2.hcluster:2181, node3.hcluster:2181</value> </property>
$ hdfs zkfc -formatZK
This command creates a znode in ZooKeeper, which is used by the automatic failover system to store its data.
$ for x in 'cd /etc/init.d; ls hadoop*'; do sudo service $x start; done
$ sudo service hadoop-hdfs-zkfc start
Your cluster is now configured for automatic failover. You can test this configuration by manually killing the active the namenode using kill -9
to see if the failover occurs.
3.144.40.189