Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Installing Apache Hadoop

Now that we have seen both the storage and processing parts of a Hadoop cluster, let's get started with the installation of Hadoop. We will use Hadoop 2.2.0 in this chapter.

Note

Hadoop 2.2.0 is not compatible with Hadoop 1.X versions.

We will be setting up a cluster on a single node. Before starting, please make sure that you have the following software installed on your system:

JDK 1.7: We need JDK to run Hadoop as it is written in Java
ssh-keygen: This is used to generate SSH keys that are used to set password-less SSH required for Hadoop

If you don't have ssh-keygen, install it with the following command:

yum install openssh-clients

Next, we will need to set up password-less SSH on this machine as it is required for Hadoop.

Setting up password-less SSH

In a Hadoop cluster, executing commands on one of the machines in turn can execute further commands on some nodes in the cluster. For example, when starting HDFS, the DataNode is started on each of the machines. This is done automatically by the scripts provided with your Hadoop distribution. Password-less SSH between all the machines in a Hadoop cluster is a mandatory requirement for these scripts to run without any user intervention. The following are the steps for setting up password-less SSH:

Generate your ssh key pair by executing the following command:

ssh-keygen -t rsa -P ''

The following information is displayed:

Generating public/private rsa key pair.
Enter file in which to save the key (/home/anand/.ssh/id_rsa):
Your identification has been saved in /home/anand/.ssh/id_rsa.
Your public key has been saved in /home/anand/.ssh/id_rsa.pub.
The key fingerprint is:
b7:06:2d:76:ed:df:f9:1d:7e:5f:ed:88:93:54:0f:[email protected]
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|            E .  |
|             o   |
|         . .  o  |
|        S + ..o |
|       . = o.   o|
|          o... .o|
|         .  oo.+*|
|            ..ooX|
+-----------------+

Next, we need to copy the generated public key to the list of authorized keys for the current user. To do this, execute the following command:
```
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
```
Now, we can check whether password-less SSH is working by connecting to localhost with ssh using the following command:
```
ssh localhost
```
The following output is displayed:
```
Last login: Wed Apr 2 09:12:17 2014 from localhost
```

Since we are able to SSH into localhost without a password, our setup is working now, and we will now proceed with the Hadoop setup.

Getting the Hadoop bundle and setting up environment variables

The following are the steps to set up Hadoop:

Download Hadoop 2.2.0 from the Apache website at http://hadoop.apache.org/releases.html#Download.
Untar the archive at a location where you want to install Hadoop using the following commands. We will call this location $HADOOP_HOME:
```
tar xzf hadoop-2.2.0.tar.gz
cd hadoop-2.2.0
```

Next, we need to set up the environment variables and path for Hadoop. Add the following entries in your ~/.bashrc file. Please make sure that you provide the paths for Java and Hadoop as per your system using the following commands:

export JAVA_HOME=/usr/java/jdk1.7.0_45
export HADOOP_HOME=/home/anand/opt/hadoop-2.2.0
export HADOOP_COMMON_HOME=/home/anand/opt/hadoop-2.2.0
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_YARN_HOME=$HADOOP_COMMON_HOME
export HADOOP_CONF_DIR=$HADOOP_COMMON_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_COMMON_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_HOME/lib"
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin

Refresh your ~/.bashrc file with the following command:
```
source ~/.bashrc
```

Now, let's check whether the paths are properly configured with the following command:

hadoop version

The following information is displayed:

Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /home/anand/opt/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar

Using the preceding steps, the paths are properly set. Now we will set up HDFS on our system.

Setting up HDFS

Please perform the following steps to set up HDFS:

Make directories to hold the NameNode and DataNode data as follows:
```
mkdir -p ~/mydata/hdfs/namenode
mkdir -p ~/mydata/hdfs/datanode
```

Specify the NameNode port in the core-site.xml file at the $HADOOP_CONF_DIR directory by adding the following property inside the <configuration> tag:

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:19000</value>
<!-- The default port for HDFS is 9000, but we are using 19000 Storm-Yarn uses port 9000 for its application master -->
</property>

Specify the NameNode and data directory in the hdfs-site.xml file at $HADOOP_CONF_DIR by adding the following property inside the <configuration> tag:

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <!-- Since we have only one node, we have replication factor=1 -->
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/home/anand/hadoop-data/hdfs/namenode</value>
  <!-- specify absolute path of the namenode directory -->
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/home/anand/hadoop-data/hdfs/datanode</value>
  <!-- specify absolute path of the datanode directory -->
</property>

Now, we will format the NameNode. This is a one-time process and needs to be done only while setting up HDFS using the following command:

hdfs namenode -format

The following output is displayed:

14/04/02 09:03:06 INFO namenode.NameNode: STARTUP_MSG:
/*********************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.2.0
… …
14/04/02 09:03:08 INFO namenode.NameNode: SHUTDOWN_MSG:
/*********************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
********************************************************/

Now, we are done with the configuration and will start HDFS with the following command:

start-dfs.sh

The following information is displayed:

14/04/02 09:27:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-secondarynamenode-localhost.localdomain.out
14/04/02 09:27:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now, execute the jps command to see whether all the processes are running fine:
```
jps
```
We will get the following output:
```
50275 NameNode
50547 SecondaryNameNode
50394 DataNode
51091 Jps
```
Here, we can see that all the expected processes are running.
Now, you can check the status of HDFS using the NameNode Web UI by opening http://localhost:50070 in your browser. You should see something similar to the following screenshot:
The NameNode Web UI
You can interact with HDFS using the hdfs dfs command. Get all the options by running the hdfs dfs command on the console or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html. Most of the commands mirror the filesystem interaction commands that you'll find on any Linux system. For example, to copy a file on HDFS, use the following command:
```
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
```

Now that HDFS is deployed, we will set up YARN next.

Setting up YARN

The following are the steps to set up YARN:

Create the mapred-site.xml file from mapred-site.xml.template using the following command:
```
cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
```
Specify that we are using the YARN framework by adding the following property in the mapred-site.xml file located in the $HADOOP_CONF_DIR directory in the <configuration> tag:
```
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
```

Configure the following properties in the yarn-site.xml file:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

<property>
  <!-- Minimum amount of memory allocated for containers in MBs.-->
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1024</value>
</property>

<property>
  <!--Total memory that can be allocated to containers in MBs. -->
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
</property>

<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <!-- This is ratio of physical memory to virtual memory used when setting memory requirements for containers. If you don't have enough RAM, increase this value. -->
  <name>yarn.nodemanager.vmem-pmem-ratio</name>
  <value>8</value>
</property>

Start the YARN processes with the following command:

start-yarn.sh

The following information is displayed:

starting yarn daemons
starting( )resourcemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-nodemanager-localhost.localdomain.out

Now, execute the jps command to see whether all the processes are running fine:
```
jps
```
We will get the following output:
```
50275 NameNode
50547 SecondaryNameNode
50394 DataNode
51091 Jps
50813 NodeManager
50716 ResourceManager
```
Here, we can see that all the expected processes are running.
Now, you can check the status of YARN using the ResourceManager Web UI by opening http://localhost:8088/cluster in your browser. You should see something similar to the following screenshot:
The ResourceManager Web UI

You can interact with YARN using the yarn command. Get all the options by running the yarn command on your console, or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html. To get all the applications currently running on YARN, run the following command:

yarn application -list

The following information is displayed:

14/04/02 11:41:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/02 11:41:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                Application-Id    Application-Name    Application-Type      User     Queue             State       Final-State       Progress                       Tracking-URL

With this, we have completed the deployment of a Hadoop cluster on a single node. Next, we will see how to run Storm topologies on this cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Installing Apache Hadoop

Create new playlist

Sign In

Sign Up

Installing Apache Hadoop

Note

Setting up password-less SSH

Getting the Hadoop bundle and setting up environment variables

Setting up HDFS

Setting up YARN

Table of Contents for
Installing Apache Hadoop