Installing Apache Hadoop

Now that we have seen both the storage and processing parts of a Hadoop cluster, let's get started with the installation of Hadoop. We will use Hadoop 2.2.0 in this chapter.

Note

Hadoop 2.2.0 is not compatible with Hadoop 1.X versions.

We will be setting up a cluster on a single node. Before starting, please make sure that you have the following software installed on your system:

  • JDK 1.7: We need JDK to run Hadoop as it is written in Java
  • ssh-keygen: This is used to generate SSH keys that are used to set password-less SSH required for Hadoop

If you don't have ssh-keygen, install it with the following command:

yum install openssh-clients

Next, we will need to set up password-less SSH on this machine as it is required for Hadoop.

Setting up password-less SSH

In a Hadoop cluster, executing commands on one of the machines in turn can execute further commands on some nodes in the cluster. For example, when starting HDFS, the DataNode is started on each of the machines. This is done automatically by the scripts provided with your Hadoop distribution. Password-less SSH between all the machines in a Hadoop cluster is a mandatory requirement for these scripts to run without any user intervention. The following are the steps for setting up password-less SSH:

  1. Generate your ssh key pair by executing the following command:
    ssh-keygen -t rsa -P ''
    

    The following information is displayed:

    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/anand/.ssh/id_rsa):
    Your identification has been saved in /home/anand/.ssh/id_rsa.
    Your public key has been saved in /home/anand/.ssh/id_rsa.pub.
    The key fingerprint is:
    b7:06:2d:76:ed:df:f9:1d:7e:5f:ed:88:93:54:0f:[email protected]
    The key's randomart image is:
    +--[ RSA 2048]----+
    |                 |
    |            E .  |
    |             o   |
    |         . .  o  |
    |        S + ..o |
    |       . = o.   o|
    |          o... .o|
    |         .  oo.+*|
    |            ..ooX|
    +-----------------+
    
  2. Next, we need to copy the generated public key to the list of authorized keys for the current user. To do this, execute the following command:
    cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
    
  3. Now, we can check whether password-less SSH is working by connecting to localhost with ssh using the following command:
    ssh localhost
    

    The following output is displayed:

    Last login: Wed Apr 2 09:12:17 2014 from localhost
    

Since we are able to SSH into localhost without a password, our setup is working now, and we will now proceed with the Hadoop setup.

Getting the Hadoop bundle and setting up environment variables

The following are the steps to set up Hadoop:

  1. Download Hadoop 2.2.0 from the Apache website at http://hadoop.apache.org/releases.html#Download.
  2. Untar the archive at a location where you want to install Hadoop using the following commands. We will call this location $HADOOP_HOME:
    tar xzf hadoop-2.2.0.tar.gz
    cd hadoop-2.2.0
    
  3. Next, we need to set up the environment variables and path for Hadoop. Add the following entries in your ~/.bashrc file. Please make sure that you provide the paths for Java and Hadoop as per your system using the following commands:
    export JAVA_HOME=/usr/java/jdk1.7.0_45
    export HADOOP_HOME=/home/anand/opt/hadoop-2.2.0
    export HADOOP_COMMON_HOME=/home/anand/opt/hadoop-2.2.0
    export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
    export HADOOP_YARN_HOME=$HADOOP_COMMON_HOME
    export HADOOP_CONF_DIR=$HADOOP_COMMON_HOME/etc/hadoop
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_COMMON_HOME/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_HOME/lib"
    export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin
    
  4. Refresh your ~/.bashrc file with the following command:
    source ~/.bashrc
    
  5. Now, let's check whether the paths are properly configured with the following command:
    hadoop version
    

    The following information is displayed:

    Hadoop 2.2.0
    Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
    Compiled by hortonmu on 2013-10-07T06:28Z
    Compiled with protoc 2.5.0
    From source with checksum 79e53ce7994d1628b240f09af91e1af4
    This command was run using /home/anand/opt/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
    

Using the preceding steps, the paths are properly set. Now we will set up HDFS on our system.

Setting up HDFS

Please perform the following steps to set up HDFS:

  1. Make directories to hold the NameNode and DataNode data as follows:
    mkdir -p ~/mydata/hdfs/namenode
    mkdir -p ~/mydata/hdfs/datanode
    
  2. Specify the NameNode port in the core-site.xml file at the $HADOOP_CONF_DIR directory by adding the following property inside the <configuration> tag:
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:19000</value>
    <!-- The default port for HDFS is 9000, but we are using 19000 Storm-Yarn uses port 9000 for its application master -->
    </property>
  3. Specify the NameNode and data directory in the hdfs-site.xml file at $HADOOP_CONF_DIR by adding the following property inside the <configuration> tag:
    <property>
      <name>dfs.replication</name>
      <value>1</value>
      <!-- Since we have only one node, we have replication factor=1 -->
    </property>
    <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/anand/hadoop-data/hdfs/namenode</value>
      <!-- specify absolute path of the namenode directory -->
    </property>
    <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/anand/hadoop-data/hdfs/datanode</value>
      <!-- specify absolute path of the datanode directory -->
    </property>
  4. Now, we will format the NameNode. This is a one-time process and needs to be done only while setting up HDFS using the following command:
    hdfs namenode -format
    

    The following output is displayed:

    14/04/02 09:03:06 INFO namenode.NameNode: STARTUP_MSG:
    /*********************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 2.2.0
    … …
    14/04/02 09:03:08 INFO namenode.NameNode: SHUTDOWN_MSG:
    /*********************************************************
    SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
    ********************************************************/
    
  5. Now, we are done with the configuration and will start HDFS with the following command:
    start-dfs.sh
    

    The following information is displayed:

    14/04/02 09:27:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Starting namenodes on [localhost]
    localhost: starting namenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-namenode-localhost.localdomain.out
    localhost: starting datanode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-datanode-localhost.localdomain.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-secondarynamenode-localhost.localdomain.out
    14/04/02 09:27:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
  6. Now, execute the jps command to see whether all the processes are running fine:
    jps
    

    We will get the following output:

    50275 NameNode
    50547 SecondaryNameNode
    50394 DataNode
    51091 Jps
    

    Here, we can see that all the expected processes are running.

  7. Now, you can check the status of HDFS using the NameNode Web UI by opening http://localhost:50070 in your browser. You should see something similar to the following screenshot:
    Setting up HDFS

    The NameNode Web UI

  8. You can interact with HDFS using the hdfs dfs command. Get all the options by running the hdfs dfs command on the console or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html. Most of the commands mirror the filesystem interaction commands that you'll find on any Linux system. For example, to copy a file on HDFS, use the following command:
    hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
    

Now that HDFS is deployed, we will set up YARN next.

Setting up YARN

The following are the steps to set up YARN:

  1. Create the mapred-site.xml file from mapred-site.xml.template using the following command:
    cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
    
  2. Specify that we are using the YARN framework by adding the following property in the mapred-site.xml file located in the $HADOOP_CONF_DIR directory in the <configuration> tag:
    <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
    </property>
  3. Configure the following properties in the yarn-site.xml file:
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    
    <property>
      <!-- Minimum amount of memory allocated for containers in MBs.-->
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>1024</value>
    </property>
    
    <property>
      <!--Total memory that can be allocated to containers in MBs. -->
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>4096</value>
    </property>
    
    <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
      <!-- This is ratio of physical memory to virtual memory used when setting memory requirements for containers. If you don't have enough RAM, increase this value. -->
      <name>yarn.nodemanager.vmem-pmem-ratio</name>
      <value>8</value>
    </property>
  4. Start the YARN processes with the following command:
    start-yarn.sh
    

    The following information is displayed:

    starting yarn daemons
    starting( )resourcemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-resourcemanager-localhost.localdomain.out
    localhost: starting nodemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-nodemanager-localhost.localdomain.out
    
  5. Now, execute the jps command to see whether all the processes are running fine:
    jps
    

    We will get the following output:

    50275 NameNode
    50547 SecondaryNameNode
    50394 DataNode
    51091 Jps
    50813 NodeManager
    50716 ResourceManager
    

    Here, we can see that all the expected processes are running.

  6. Now, you can check the status of YARN using the ResourceManager Web UI by opening http://localhost:8088/cluster in your browser. You should see something similar to the following screenshot:
    Setting up YARN

    The ResourceManager Web UI

  7. You can interact with YARN using the yarn command. Get all the options by running the yarn command on your console, or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html. To get all the applications currently running on YARN, run the following command:
    yarn application -list
    

    The following information is displayed:

    14/04/02 11:41:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    14/04/02 11:41:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                    Application-Id    Application-Name    Application-Type      User     Queue             State       Final-State       Progress                       Tracking-URL
    

With this, we have completed the deployment of a Hadoop cluster on a single node. Next, we will see how to run Storm topologies on this cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.20.68