Chapter 3. YARN Installation

In this section, we'll cover the installation of Hadoop and YARN and their configuration for a single-node and single-cluster setup. Now, we will consider Hadoop as two different components: one is Hadoop Distributed File System (HDFS), the other is YARN. The YARN components take care of resource allocation and the scheduling of the jobs that run over the data stored in HDFS. We'll cover most of the configurations to make YARN distributed computing more optimized and efficient.

In this chapter, we will cover the following topics:

  • Hadoop and YARN single-node installation
  • Hadoop and YARN fully-distributed mode installation
  • Operating Hadoop and YARN clusters

Single-node installation

Let's start with the steps for Hadoop's single-node installations, as it's easy to understand and set up. This way, we can quickly perform simple operations using Hadoop MapReduce and the HDFS.

Prerequisites

Here are some prerequisites needed for Hadoop installations; make sure that the prerequisites are fulfilled to start working with Hadoop and YARN.

Platform

GNU/Unix is supported for Hadoop installation as a development as well as a production platform. The Windows platform is also supported for Hadoop installation, with some extra configurations. Now, we'll focus more on Linux-based platforms, as Hadoop is more widely used with these platforms and works more efficiently with Linux compared to Windows systems. Here are the steps for single-node Hadoop installation for Linux systems. If you want to install it on Windows, refer to the Hadoop wiki page for the installation steps.

Software

Here's some software; make sure that they are installed before installing Hadoop.

Java must be installed. Confirm whether the Java version is compatible with the Hadoop version that is to be installed by checking the Hadoop wiki page (http://wiki.apache.org/hadoop/HadoopJavaVersions).

SSH and SSHD must be installed and running, as they are used by Hadoop scripts to manage remote Hadoop daemons.

Now, download the recent stable release of the Hadoop distribution from Apache mirrors and archives using the following command:

$$ wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

Note that at the time of writing this book, Hadoop 2.6.0 is the most recent stable release. Now use the following commands:

$$ mkdir –p /opt/yarn
$$ cd /opt/yarn
$$ tar xvzf /root/hadoop-2.6.0.tar.gz

Starting with the installation

Now, unzip the download distribution under the /etc/ directory. Change the Hadoop environmental parameters as per the following configurations.

Set the JAVA_HOME environmental parameter to the JAVA root installed before:

$$ export JAVA_HOME=etc/java/latest

Set the Hadoop home to the Hadoop installation directory:

$$ export HADOOP_HOME=etc/hadoop

Try running the Hadoop command. It should display the Hadoop documentation; this indicates a successful Hadoop configuration.

Now, our Hadoop single-node setup is ready to run in the following modes.

The standalone mode (local mode)

By default, Hadoop runs in standalone mode as a single Java process. This mode is useful for development and debugging.

The pseudo-distributed mode

Hadoop can run on a single node in pseudo-distributed mode, as each daemon is run as a separate Java process. To run Hadoop in pseudo-distributed mode, follow these configuration instructions. First, navigate to the /etc/hadoop/core-site.xml.

This configuration for the NameNode setup will run on localhost port 9000. You can set the following property for the NameNode:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Now navigate to /etc/hadoop/hdfs-site.xml.

By setting the following property, we are ensuring that the replication factor of each data block is 3 (by default, the replication factor is 3):

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
</configuration>

Then, format the Hadoop filesystem using this command:

$$ $HADOOP_HOME/bin/hdfs namenode –format

After formatting the filesystem, start the namenode and datanode daemons using the next command. You can see logs under the $HADOOP_HOME/logs directory by default:

$$ $HADOOP_HOME/sbin/start-dfs.sh

Now, we can see the namenode UI on the web interface. Hit http://localhost:50070/ in the browser.

Create the HDFS directories that are required to run MapReduce jobs:

$$ $HADOOP_HOME/bin/hdfs -mkdir /user
$$ $HADOOP_HOME/bin/hdfs -mkdir /user/{username}

To MapReduce job on YARN in pseudo-distributed mode, you need to start the ResourceManager and NodeManager daemons. Navigate to /etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Navigate to /etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Now, start the ResourceManager and NodeManager daemons by issuing this command:

$$ sbin/start-yarn.sh

By simply navigating to http://localhost:8088/ in your browser, you can see the web interface for the ResourceManager. From here, you can start, restart, or stop the jobs.

To stop the YARN daemons, you need to run the following command:

$$ $HADOOP_HOME/sbin/stop-yarn.sh

This is how we can configure Hadoop and YARN in a single node in standalone and pseudo-distributed modes. Moving forward, we will focus on fully-distributed mode. As the basic configuration remains the same, we only need to do some extra configuration for fully-distributed mode. Single-node setup is mainly used for development and debugging of distributed applications, while fully-distributed mode is used for the production setup.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.114.142