In this section, we'll cover the installation of Hadoop and YARN and their configuration for a single-node and single-cluster setup. Now, we will consider Hadoop as two different components: one is Hadoop Distributed File System (HDFS), the other is YARN. The YARN components take care of resource allocation and the scheduling of the jobs that run over the data stored in HDFS. We'll cover most of the configurations to make YARN distributed computing more optimized and efficient.
In this chapter, we will cover the following topics:
Let's start with the steps for Hadoop's single-node installations, as it's easy to understand and set up. This way, we can quickly perform simple operations using Hadoop MapReduce and the HDFS.
Here are some prerequisites needed for Hadoop installations; make sure that the prerequisites are fulfilled to start working with Hadoop and YARN.
GNU/Unix is supported for Hadoop installation as a development as well as a production platform. The Windows platform is also supported for Hadoop installation, with some extra configurations. Now, we'll focus more on Linux-based platforms, as Hadoop is more widely used with these platforms and works more efficiently with Linux compared to Windows systems. Here are the steps for single-node Hadoop installation for Linux systems. If you want to install it on Windows, refer to the Hadoop wiki page for the installation steps.
Here's some software; make sure that they are installed before installing Hadoop.
Java must be installed. Confirm whether the Java version is compatible with the Hadoop version that is to be installed by checking the Hadoop wiki page (http://wiki.apache.org/hadoop/HadoopJavaVersions).
SSH and SSHD must be installed and running, as they are used by Hadoop scripts to manage remote Hadoop daemons.
Now, download the recent stable release of the Hadoop distribution from Apache mirrors and archives using the following command:
$$ wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Note that at the time of writing this book, Hadoop 2.6.0 is the most recent stable release. Now use the following commands:
$$ mkdir –p /opt/yarn $$ cd /opt/yarn $$ tar xvzf /root/hadoop-2.6.0.tar.gz
Now, unzip the download distribution under the /etc/
directory. Change the Hadoop environmental parameters as per the following configurations.
Set the JAVA_HOME environmental parameter to the JAVA root installed before:
$$ export JAVA_HOME=etc/java/latest
Set the Hadoop home to the Hadoop installation directory:
$$ export HADOOP_HOME=etc/hadoop
Try running the Hadoop command. It should display the Hadoop documentation; this indicates a successful Hadoop configuration.
Now, our Hadoop single-node setup is ready to run in the following modes.
By default, Hadoop runs in standalone mode as a single Java process. This mode is useful for development and debugging.
Hadoop can run on a single node in pseudo-distributed mode, as each daemon is run as a separate Java process. To run Hadoop in pseudo-distributed mode, follow these configuration instructions. First, navigate to the /etc/hadoop/core-site.xml
.
This configuration for the NameNode setup will run on localhost port 9000
. You can set the following property for the NameNode:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Now navigate to /etc/hadoop/hdfs-site.xml
.
By setting the following property, we are ensuring that the replication factor of each data block is 3 (by default, the replication factor is 3):
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
Then, format the Hadoop filesystem using this command:
$$ $HADOOP_HOME/bin/hdfs namenode –format
After formatting the filesystem, start the namenode and datanode daemons using the next command. You can see logs under the $HADOOP_HOME/logs
directory by default:
$$ $HADOOP_HOME/sbin/start-dfs.sh
Now, we can see the namenode UI on the web interface. Hit http://localhost:50070/
in the browser.
Create the HDFS directories that are required to run MapReduce jobs:
$$ $HADOOP_HOME/bin/hdfs -mkdir /user $$ $HADOOP_HOME/bin/hdfs -mkdir /user/{username}
To MapReduce job on YARN in pseudo-distributed mode, you need to start the ResourceManager and NodeManager daemons. Navigate to /etc/hadoop/mapred-site.xml
:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Navigate to /etc/hadoop/yarn-site.xml
:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Now, start the ResourceManager and NodeManager daemons by issuing this command:
$$ sbin/start-yarn.sh
By simply navigating to http://localhost:8088/
in your browser, you can see the web interface for the ResourceManager. From here, you can start, restart, or stop the jobs.
To stop the YARN daemons, you need to run the following command:
$$ $HADOOP_HOME/sbin/stop-yarn.sh
This is how we can configure Hadoop and YARN in a single node in standalone and pseudo-distributed modes. Moving forward, we will focus on fully-distributed mode. As the basic configuration remains the same, we only need to do some extra configuration for fully-distributed mode. Single-node setup is mainly used for development and debugging of distributed applications, while fully-distributed mode is used for the production setup.
18.223.114.142