Installing and configuring Hadoop

We will consider Hadoop installation using two methods: using a TAR file on Ubuntu and using the .rpm file on Red Hat distros. We will have a basic, running Hadoop cluster, and then we will set up an HBase cluster in detail. Execute the following steps:

  1. Download Hadoop v2, the stable version of which is available at http://www.webhostingjams.com/mirror/. If you want a version other than the one we use here, you can always visit http://www.webhostingjams.com/mirror/apache/hadoop/common/ and download the version that you want to install.
  2. We can download Hadoop using the wget command, as follows:
    Wget <full link of file which you get by clicking on the link and getting copy link address in web browser>
    

    Alternatively, we can also open this link in a browser and download it. Once this file is downloaded, we need to extract and copy it to the location where we need to keep it.

    We can find the configuration files that need to be changed in the Hadoop-<version> directory. All configuration files that need to be changed can be found at Hadoop-<version>/etc/hadoop. We have two options for configuration: one in which we can keep the files in the same structure it exists and make changes, and another in which we copy the file to the /etc directory of the file system and create a symbolic link in the current directory. For a start up configuration, we will keep it as it is and start the configuration.

  3. Once extracted, we have a folder structure as shown in the next screenshot:
    Installing and configuring Hadoop

The left-hand side of the preceding screenshot shows the older directory structure of Hadoop Version 1 or previous versions, and the right-hand side shows the directory layout of Version 2 or above.

Let's discuss some important directories from the preceding screenshot briefly, which are important to be considered while configuration and working:

  • bin: This directory contains Hadoop binary files
  • conf: This directory contains configuration files
  • lib: This directory contains JAR files or Hadoop precompiled library files
  • sbin: This directory exists in the newer versions of Hadoop and contains scripts to start/stop clusters and scripts for some other administrative tasks
  • etc: This is the directory in the newer versions of Hadoop containing configuration files

Note

For an older version of Hadoop, we need to change the configuration files inside the conf directory and start/stop cluster using files in the bin directory. In the newer versions of Hadoop, the configuration needs to be done in the etc/hadoop directory, and start/stop processes can be initiated using the scripts found in the sbin directory.

Now, let's start building the Apache Hadoop cluster. The following are the steps to install and configure Apache Hadoop:

  1. To download Hadoop, change the directory to the desired location as follows:
    Cd ~/Downloads
    
  2. Get Hadoop using the following command:
    wget http://www.carfab.com/apachesoftware/hadoop/common/stable/hadoop-2.2.0.tar.gz
    

    Refer to the following screenshot:

    Installing and configuring Hadoop
  3. Once this file is downloaded, let's copy it to some location and extract it:
    mv hadoop-2.2.0.tar.gz ~/hbinaries
    tar –xvzfhadoop-2.2.0.tar.gz
    mv hadoop-2.2.0 hadoop
    cd ~/hbinaries
    

    Have a look at the next screenshot:

    Installing and configuring Hadoop
  4. Once a file is downloaded, extracted, and moved to the desired location, we can start configuring Hadoop. So, when we list them out, we find the following directory structure. We need to go to the etc folder where the configuration files lie:
    Installing and configuring Hadoop
  5. The following two categories of files are found in etc/hadoop:
    • Default: These are the default files with default values; we need not touch or edit these files:

      core-default.xml: This contains common default properties for the whole cluster

      hdfs-default.xml: This contains HDFS-related default settings

      yarn-default.xml: This contains the YARN framework-related default settings

      mapred-default.xml: This contains MapReduce-related default settings

    • Site-specific files: These are the files where we make changes, and the values in these files will override the values in default files:

      core-site.xml: The values given in this file override the core-default.xml settings

      hdfs-site.xml: The values given in this file override the hdfs-default.xml settings

      yarn-site.xml: The values given in this file override the yarn-default.xml settings

      mapred-site.xml: The values given in this file override the mapred-default.xml settings

We have some runtime files where we need to make changes, such as the hadoop-env.sh file that runs as an initial script for the Hadoop start up process. In this file, we set runtime variables such as a Java path and heap memory settings, and we execute the Java processes.

So, we can think of the *-default files as the files that are by default shipped with Hadoop with preset parameters and values. If we need to change some parameter value, we can add it to site-specific files such as *.site.xml.

These files are in the XML format, they start and end with a <configuration> tag, and all parameters lie in between the two tags. Inside <configuration>, we have <property><name>{name of parameter}|</name><value>{value of the parameter}</value><description>{explanation about the tag}</description>.

The following is a sample configuration file:

Installing and configuring Hadoop

Hadoop runs in three modes. We will not discuss these in detail, only briefly so that we can understand the mode of operations:

  • Standalone mode: This is the default mode that can run without any configuration changes as it is downloaded, extracted, and initiated. This has minimal configuration requirements. It does not use a distributed file system, but a local file system, and it does not initiate any Hadoop daemon processes. This mode is not suitable; it is just to check Hadoop testing.
  • Pseudo-distributed mode: This can be seen as a cluster on a single machine. In this mode, Hadoop virtually runs as a cluster, but not on many machines. In this mode, Hadoop uses a distributed file system and runs all Hadoop daemons, and all these daemons run under a single JVM. This can be used to visualize a Hadoop cluster, testing purpose, code testing, or testing environment.
  • Fully distributed mode: In this mode, the cluster is spread on many nodes as master and slave nodes. The daemons of Hadoop run on separate machines and they are best for a production environment. Each machine will have its own JVM and daemons (such as NameNode and DataNode) on separate machines.

Configuring pseudo- and fully distributed configuration is almost the same. Once configured as pseudo, we can distribute configuration files to other nodes, with some changes in configuration files, and can make it a fully distributed cluster.

Now, let's make changes in the required files. For this, what we can do is open these files in any text editor such as gedit or nano, or we can use command-line editing tools such as vi or vim.

Note

Make changes or put values according to your machine name; here, it is listed as per the current machine configuration.

Here, we will configure the newer version (V2) of Hadoop, which uses YARN for the MapReduce framework for configuration of older versions (1 or lower) of Hadoop and HBase. Follow these links to learn how to configure it step by step:

https://www.youtube.com/watch?v=Mmqv-CvSTaQ

http://helpmetocode.blogspot.com

Some configurations parameters are mentioned in the following sections.

core-site.xml

The core-site.xml configuration file overrides the default values for the core Hadoop properties.

The following is the configuration for Hadoop newer versions:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://infinity:9000</value>
    <description>Namenode address</description>
  </property>
</configuration>

The following is the configuration for Hadoop older versions:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://infinity:9000</value>
    <description>Namenode address</description>
  </property>
</configuration>

The hostname might vary according to your machine name, and you can give any unused port number.

hdfs-site.xml

The hdfs-site.xml file contains the site-wide property to enable/disable the features such as various paths and tuning parameters. You will see the following code in this file:

<configuration>

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/disk1/namenodemetacopy1,/disk2/amenodemetacopy2</value>
    <description>This is the directory where hadoopmetadata will be stored, always better to have more than one copy of metadata directory for robustness</description>
  </property>

  <property>
    <name>dfs.blocksize</name>
    <value>134217728</value>
    <description>This is block size of Hadoop that depends on data you are hosting. We can keep it from 6</description>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/disk1/datadirectory1,/disk1/datadirectory2</value>
    <description>This parameter defines where actual data will be stored on DataNodes. We can give one or more directory paths separated by commas. There should be proper permission on the directory which we specify here.</description>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>5</value>
    <description>This defines the replication factor for not-so-valuable data and huge amount of data. We can keep it lower and for costly data, we can keep it higher. This defines how many copies of each block of a file will be kept on the cluster and this is how reliably the data will be stored if machines fail.</description>
  </property>

  <property>
    <name>dfs.namenode.handler.count</name>
    <value>200</value>
    <description>Defines number of NameNode threads higher value means more datanodes.</description>
  </property>

</configuration>

yarn-site.xml

The YARN framework-related configuration is stored in the yarn-site.xml file. Have a look at the following code:

<configuration>

  <property>
    <name>yarn.resourcemanager.address</name>
    <value>infinity:9002</value>
    <description>This is host where yarn manger will be hosted</description>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>infinity:9003</value>
    <description>Yarn scheduler address</description>
  </property>

  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>infinity:9004</value>
    <description>Management interface of resource manager</description>
  </property>

  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>infinity:9005</value>
    <description>resource manager resource tracker interface</description>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    <description>Specifies mapreduce framework to be used</description>
  </property>

  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>1000</value>
    <description>Max memory forschedular</description>
  </property>

</configuration>

mapred-site.xml

The mapred-site.xml file contains the MapReduce framework-related configurations. Have a look at the following code:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    <description>Framework to be used for map reduce</description>
    </property>
  </configuration>

There are many other configuration parameters that can be found in the *-default.xml files and added to the *-site.xml files to override the default parameters. We will discuss more parameters and their optimal values in Chapter 6, HBase Cluster Maintenance and Troubleshooting, and Chapter 7, Scripting in HBase.

hadoop-env.sh

The hadoop-env.sh file contains the environment and runtime variables, which is as follows:

export JAVA_HOME=<path of Java here>
export HADOOP_CONF_DIR=<hadoopconf directory path here>
export HADOOP_HEAPSIZE=<amount of memory available to jvm>

yarn-env.sh

The yarn-env.sh file contains the YARN framework-related runtime and environment configuration:

export JAVA_HOME=<path of Java here>
export YARN_CONF_DIR=<configuration conf directory of yarn/hadoop>
JAVA_HEAP_MAX=-Xmx1000m

Slaves file

The slaves file contains the list of hosts where DataNode daemons will run. The following are the host names of DataNode machines:

datanode1
datanode2
datanode3

Add the nodes where you want to host DataNode services.

After all these changes are made, verify the paths and parameters in these files. Once verified, we can move forward to start Hadoop services. Once all Hadoop services start successfully, we can move forward to configure HBase.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.95.245