Adding new nodes to an existing cluster

Hadoop supports adding new nodes to an existing cluster without shutting down or restarting any service. This recipe will outline the steps required to add a new node to a pre-existing cluster.

Getting ready

Ensure that you have a Hadoop cluster up and running. In addition, ensure that you have the Hadoop distribution extracted, and the configuration files have been updated with the settings from the recipe titled Starting Hadoop in distributed mode.

We will use the following terms for our imaginary cluster:

Server name

Purpose

Number of dedicated machines

head

Will run the NameNode and JobTracker services

1

secondary

Will run the Secondary NameNode service

1

worker(n)

Will run the TraskTracker and DataNode services

3 or greater

How to do it...

Follow these steps to add new nodes to an existing cluster:

  1. From the head node, update the slaves configuration file with the hostname of the new node:
    $ vi conf/slaves
    worker1
    worker2
    worker3
    worker4
    
  2. Log in to the new node and start the DataNode and TaskTracker services:
    $ ssh hadoop@worker4
    $ cd /path/to/hadoop
    $ bin/hadoop-daemon.sh start datanode
    $ bin/hadoop-daemon.sh start tasktracker

How it works...

We updated the slaves configuration file on the head node to tell the Hadoop framework that a new node exists in the cluster. However, this file is only read when the Hadoop services are started (for example, by executing the bin/start-all.sh script). In order to add the new node to the cluster without having to restart all of the Hadoop services, we logged into the new node, and started the DataNode and TaskTracker services manually.

Note

The DataNode and TaskTracker services will automatically start the next time the cluster is restarted.

There's more...

When you add a new node to the cluster, the cluster is not properly balanced. HDFS will not automatically redistribute any existing data to the new node in order to balance the cluster. To rebalance the existing data in the cluster, you can run the following command from the head node:

# bin/start-balancer.sh

Note

Rebalancing a Hadoop cluster is a network-intensive task. Imagine, we might be moving terabytes of data around, depending on the number of nodes added to the cluster. Job performance issues might arise when a cluster is in the process of rebalancing, and therefore regular rebalancing maintenance should be properly planned.

See also

  • Safely decommissioning nodes
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.40.189