Starting Hadoop in pseudo-distributed mode

Hadoop supports three different operating modes:

  • Standalone mode: In this mode, Hadoop will run as a single process on a single node.
  • Pseudo-distributed mode: In this mode, Hadoop will run all services in separate processes on a single node.
  • Fully-distributed mode: In this mode, Hadoop will run all services in separate processes across multiple nodes.

This recipe will describe how to install and set up Hadoop to run in pseudo-distributed mode. In pseudo-distributed mode, all of the HDFS and MapReduce processes will start on a single node. Pseudo-distributed mode is an excellent environment to test your HDFS operations and/or your MapReduce applications on a subset of the data.

Getting ready

Ensure that you have Java 1.6, ssh, and sshd installed. In addition, the ssh daemon (sshd) should be running on the node. You can validate the installation of these applications by using the following commands:

$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

$ ssh
usage: ssh [-1246AaCfgkMNnqsTtVvXxY] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-e escape_char] [-F configfile]
           [-i identity_file] [-L [bind_address:]port:host:hostport]
           [-l login_name] [-m mac_spec] [-O ctl_cmd] [-o option] [-p port]
           [-R [bind_address:]port:host:hostport] [-S ctl_path]
           [-w tunnel:tunnel] [user@]hostname [command]

$ service sshd status
openssh-daemon (pid  2004) is running...

How to do it...

Carry out the following steps to start Hadoop in pseudo-distributed mode:

  1. Create a Hadoop user account. This is not specifically required to get Hadoop running in pseudo-distributed mode, but it is a common and good security practice. Ensure that the JAVA_HOME environment property is set to the folder of the system's Java installation:
    # useradd hadoop
    # passwd hadoop
    # su – hadoop
    $ echo $JAVA_HOME
    $ /usr/java/jdk1.6.0_31
  2. Generate an ssh public and private key pair to allow password-less login to the node using the Hadoop user account. When asked for a passphrase, hit the Enter key, ensuring no passphrase will be used:
    $ su – hadoop
    $ ssh-keygen –t rsa
  3. Add the public key to the authorized key list:

    Note

    If you have more than one node, you will need to copy this key to every node in the cluster.

    $ ssh-copy-id –i /home/hadoop/.ssh/id_rsa.pub hadoop@localhost
  4. Test the password-less ssh login. You should be able to ssh to localhost using your hadoop account without providing a password:
    $ ssh localhost
  5. Download a Hadoop distribution from http://hadoop.apache.org using the Hadoop user account. We used Hadoop 0.20.x for this installation:
    # su – hadoop
    $ tar –zxvf hadoop-0.20.x.tar.gz
  6. Change the following configuration files located in the conf folder of the extracted Hadoop distribution. These configuration changes will allow Hadoop to run in pseudo-distributed mode:
    $ vi conf/core-site.xml
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
      </property>
    </configuration>
    $ vi conf/hdfs-site.xml
    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    </configuration>
    $ vi conf/mapred-site.xml
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
      </property>
    </configuration>
  7. Format the Hadoop NameNode:
    $ bin/hadoop namenode –format
  8. Start all of the Hadoop HDFS and MapReduce services:
    $ bin/start-all.sh
  9. Verify all of the services started successfully by looking at the NameNode status page http://localhost:50070/, and the JobTracker page http://localhost:50030/. You can stop all of the Hadoop services by running the bin/stop-all.sh script.

How it works...

Steps 1 through 4 sets up a single node for a password-less login using ssh.

Next, we downloaded a distribution of Hadoop and configured the distribution to run in pseudo-distributed mode. The fs.default.name property was set to a URI to tell Hadoop where to find the HDFS implementation, which is running on our local machine and listening on port 8020. Next, we set the replication factor of HDFS to 1 using the dfs.replication property. Since we are running all of the Hadoop services on a single node, there is no need to replicate any information. If we did, all of the replicated information would reside on the single node. We set the value of the last configuration property mapred.job.tracker to localhost:8021. The mapred.job.tracker property tells Hadoop where to find the JobTracker.

Finally, we formatted the NameNode and started the Hadoop services. You need to format the NameNode after you set up a new Hadoop cluster. Formatting a NameNode will erase all of the data in the cluster.

There's more...

By default, the Hadoop distribution comes configured to run in standalone mode. In standalone mode, there is no need to start any Hadoop service. In addition, input and output folders will be located on the local filesystem, instead of HDFS. To run a MapReduce job in standalone mode, use the configuration files that initially came with the distribution. Create an input folder on the local filesystem and use the Hadoop shell script:

$ mkdir input
$ cp somefiles*.txt input/
$ /path/to/hadoop/bin/hadoop jar myjar.jar input/*.txt output

See also

  • Starting Hadoop in distributed mode
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.19.185