Now that we have seen both the storage and processing parts of a Hadoop cluster, let's get started with the installation of Hadoop. We will use Hadoop 2.2.0 in this chapter.
We will be setting up a cluster on a single node. Before starting, please make sure that you have the following software installed on your system:
If you don't have ssh-keygen, install it with the following command:
yum install openssh-clients
Next, we will need to set up password-less SSH on this machine as it is required for Hadoop.
In a Hadoop cluster, executing commands on one of the machines in turn can execute further commands on some nodes in the cluster. For example, when starting HDFS, the DataNode is started on each of the machines. This is done automatically by the scripts provided with your Hadoop distribution. Password-less SSH between all the machines in a Hadoop cluster is a mandatory requirement for these scripts to run without any user intervention. The following are the steps for setting up password-less SSH:
ssh key
pair by executing the following command:ssh-keygen -t rsa -P ''
The following information is displayed:
Generating public/private rsa key pair. Enter file in which to save the key (/home/anand/.ssh/id_rsa): Your identification has been saved in /home/anand/.ssh/id_rsa. Your public key has been saved in /home/anand/.ssh/id_rsa.pub. The key fingerprint is: b7:06:2d:76:ed:df:f9:1d:7e:5f:ed:88:93:54:0f:[email protected] The key's randomart image is: +--[ RSA 2048]----+ | | | E . | | o | | . . o | | S + ..o | | . = o. o| | o... .o| | . oo.+*| | ..ooX| +-----------------+
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
ssh
using the following command:ssh localhost
The following output is displayed:
Last login: Wed Apr 2 09:12:17 2014 from localhost
Since we are able to SSH into localhost without a password, our setup is working now, and we will now proceed with the Hadoop setup.
The following are the steps to set up Hadoop:
$HADOOP_HOME
:tar xzf hadoop-2.2.0.tar.gz cd hadoop-2.2.0
~/.bashrc
file. Please make sure that you provide the paths for Java and Hadoop as per your system using the following commands:export JAVA_HOME=/usr/java/jdk1.7.0_45 export HADOOP_HOME=/home/anand/opt/hadoop-2.2.0 export HADOOP_COMMON_HOME=/home/anand/opt/hadoop-2.2.0 export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME export HADOOP_YARN_HOME=$HADOOP_COMMON_HOME export HADOOP_CONF_DIR=$HADOOP_COMMON_HOME/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_COMMON_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_COMMON_HOME/lib" export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin
~/.bashrc
file with the following command:source ~/.bashrc
hadoop version
The following information is displayed:
Hadoop 2.2.0 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /home/anand/opt/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
Using the preceding steps, the paths are properly set. Now we will set up HDFS on our system.
Please perform the following steps to set up HDFS:
mkdir -p ~/mydata/hdfs/namenode mkdir -p ~/mydata/hdfs/datanode
core-site.xml
file at the $HADOOP_CONF_DIR
directory by adding the following property inside the <configuration>
tag:<property> <name>fs.default.name</name> <value>hdfs://localhost:19000</value> <!-- The default port for HDFS is 9000, but we are using 19000 Storm-Yarn uses port 9000 for its application master --> </property>
hdfs-site.xml
file at $HADOOP_CONF_DIR
by adding the following property inside the <configuration>
tag:<property> <name>dfs.replication</name> <value>1</value> <!-- Since we have only one node, we have replication factor=1 --> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/anand/hadoop-data/hdfs/namenode</value> <!-- specify absolute path of the namenode directory --> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/anand/hadoop-data/hdfs/datanode</value> <!-- specify absolute path of the datanode directory --> </property>
hdfs namenode -format
The following output is displayed:
14/04/02 09:03:06 INFO namenode.NameNode: STARTUP_MSG: /********************************************************* STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.2.0 … … 14/04/02 09:03:08 INFO namenode.NameNode: SHUTDOWN_MSG: /********************************************************* SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1 ********************************************************/
start-dfs.sh
The following information is displayed:
14/04/02 09:27:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] localhost: starting namenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-namenode-localhost.localdomain.out localhost: starting datanode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-datanode-localhost.localdomain.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /home/anand/opt/hadoop-2.2.0/logs/hadoop-anand-secondarynamenode-localhost.localdomain.out 14/04/02 09:27:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
jps
command to see whether all the processes are running fine:jps
We will get the following output:
50275 NameNode 50547 SecondaryNameNode 50394 DataNode 51091 Jps
Here, we can see that all the expected processes are running.
http://localhost:50070
in your browser. You should see something similar to the following screenshot:hdfs dfs
command. Get all the options by running the hdfs dfs
command on the console or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html. Most of the commands mirror the filesystem interaction commands that you'll find on any Linux system. For example, to copy a file on HDFS, use the following command:hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
Now that HDFS is deployed, we will set up YARN next.
The following are the steps to set up YARN:
mapred-site.xml
file from mapred-site.xml.template
using the following command:cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
mapred-site.xml
file located in the $HADOOP_CONF_DIR
directory in the <configuration>
tag:<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
yarn-site.xml
file:<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <!-- Minimum amount of memory allocated for containers in MBs.--> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <!--Total memory that can be allocated to containers in MBs. --> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <!-- This is ratio of physical memory to virtual memory used when setting memory requirements for containers. If you don't have enough RAM, increase this value. --> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>8</value> </property>
start-yarn.sh
The following information is displayed:
starting yarn daemons starting( )resourcemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-resourcemanager-localhost.localdomain.out localhost: starting nodemanager, logging to /home/anand/opt/hadoop-2.2.0/logs/yarn-anand-nodemanager-localhost.localdomain.out
jps
command to see whether all the processes are running fine:jps
We will get the following output:
50275 NameNode 50547 SecondaryNameNode 50394 DataNode 51091 Jps 50813 NodeManager 50716 ResourceManager
Here, we can see that all the expected processes are running.
http://localhost:8088/cluster
in your browser. You should see something similar to the following screenshot:yarn
command. Get all the options by running the yarn
command on your console, or refer to the documentation at http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html. To get all the applications currently running on YARN, run the following command:yarn application -list
The following information is displayed:
14/04/02 11:41:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/04/02 11:41:42 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
With this, we have completed the deployment of a Hadoop cluster on a single node. Next, we will see how to run Storm topologies on this cluster.
18.118.20.68