Starting Hadoop in distributed mode

As mentioned in the previous recipe, Hadoop supports three different operating modes:

  • Standalone mode
  • Pseudo-distributed mode
  • Fully-distributed mode

This recipe will describe how to set up Hadoop to run in fully-distributed mode. In fully-distributed mode, HDFS and the MapReduce services will run across multiple machines. A typical architecture is to have a dedicated node run the NameNode and the JobTracker services, another dedicated node to host the Secondary NameNode service, and the remaining nodes in the cluster running both the DataNode and TaskTracker services.

Getting ready

This recipe will assume that steps 1 through 5 from the recipe Starting Hadoop in pseudo-distributed mode of this chapter have been completed. There should be a user named hadoop on every node in the cluster. In addition, the rsa public key generated in step 2 of the previous recipe must be distributed and installed on every node in the cluster using the ssh-copy-id command. Finally, the Hadoop distribution should be extracted and deployed on every node in the cluster.

We will now discuss the specific configurations required to get the cluster running in distributed mode. We will assume that your cluster will use the following configuration:

Server name

Purpose

Number of dedicated machines

head

Will run the NameNode and JobTracker services

1

secondary

Will run the Secondary NameNode service

1

worker(n)

Will run the TaskTracker and DataNode services

3 or greater

How to do it...

Perform the following steps to start Hadoop in fully-distributed mode:

  1. Update the following configuration files on all of the nodes in the cluster:
    $ vi conf/core-site.xml
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://head:8020</value>
      </property>
    </configuration>
    
    $ vi conf/hdfs-site.xml
    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>3</value>
      </property>
    </configuration>
    
    $ vi conf/mapred-site.xml
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>head:8021</value>
      </property>
    </configuration>
  2. Update the masters and slaves configuration files on the head node. The masters configuration file contains the hostname of the node which will run the Secondary NameNode. The slaves configuration file contains a list of the hosts which will run the TaskTracker and DataNode services:
    $ vi conf/masters
    secondary
    $ vi conf/slaves
    worker1
    worker2
    worker3
  3. Format the Hadoop NameNode from the head node:
    $ bin/hadoop namenode –format
  4. From the head node as the hadoop user, start all of the Hadoop services:
    $ bin/start-all.sh
  5. Confirm that all of the correct services are running on the proper nodes:
    • On the master: Both the NameNode and JobTracker services should running
    • On the secondary: The Secondary NameNode service should be running
    • On the worker nodes: The DataNode and TaskTracker services should be running

How it works...

First we changed the Hadoop configuration files core-site.xml, hdfs-site.xml, and mapred-site.xml on every node in the cluster. These configuration files need to be updated to tell the Hadoop services running on every node where to find the NameNode and JobTracker services. In addition, we changed the HDFS replication factor to 3. Since we have three or more nodes available, we changed the replication from 1 to 3 in order to support high data availability in case one of the worker nodes experiences a failure.

There's more...

It is not necessary to run the Secondary NameNode on a separate node. You can run the Secondary NameNode on the same node as the NameNode and JobTracker, if you wish. To do this, stop the cluster, modify the masters configuration file on the master node, and restart all of the services:

$ bin/stop-all.sh
$ vi masters
head
$ bin/start-all.sh

Another set of configuration parameters that will come in handy when your cluster grows or when you wish to perform maintenance, are the exclusion list parameters that can be added to the mapred-site.xml configuration file. By adding the following lines to mapred-site.xml, you can list the nodes that will be barred from connecting to the NameNode (dfs.hosts.exclude) and/or the JobTracker (mapred.hosts.exclude). These configuration parameters will be used later when we discuss decommissioning of a node in the cluster:

<property>
    <name>dfs.hosts.exclude</name>
    <value>/path/to/hadoop/dfs_excludes</value>
    <final>true</final>
  </property>
  <property>
    <name>mapred.hosts.exclude</name>
    <value>/path/to/hadoop/mapred_excludes </value>
    <final>true</final>
  </property>

Create two empty files named dfs_excludes, and mapred_excludes for future use:

$ touch /path/to/hadoop/dfs_excludes
$ touch /path/to/hadoop/mapred_excludes

Start the cluster:

$ bin/start-all.sh

See also

  • Adding new nodes to an existing cluster
  • Safely decommissioning nodes
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.89.30