We will consider Hadoop installation using two methods: using a TAR file on Ubuntu and using the .rpm
file on Red Hat distros. We will have a basic, running Hadoop cluster, and then we will set up an HBase cluster in detail. Execute the following steps:
wget
command, as follows:Wget <full link of file which you get by clicking on the link and getting copy link address in web browser>
Alternatively, we can also open this link in a browser and download it. Once this file is downloaded, we need to extract and copy it to the location where we need to keep it.
We can find the configuration files that need to be changed in the Hadoop-<version>
directory. All configuration files that need to be changed can be found at Hadoop-<version>/etc/hadoop
. We have two options for configuration: one in which we can keep the files in the same structure it exists and make changes, and another in which we copy the file to the /etc
directory of the file system and create a symbolic link in the current directory. For a start up configuration, we will keep it as it is and start the configuration.
The left-hand side of the preceding screenshot shows the older directory structure of Hadoop Version 1 or previous versions, and the right-hand side shows the directory layout of Version 2 or above.
Let's discuss some important directories from the preceding screenshot briefly, which are important to be considered while configuration and working:
bin
: This directory contains Hadoop binary filesconf
: This directory contains configuration fileslib
: This directory contains JAR files or Hadoop precompiled library filessbin
: This directory exists in the newer versions of Hadoop and contains scripts to start/stop clusters and scripts for some other administrative tasksetc
: This is the directory in the newer versions of Hadoop containing configuration filesFor an older version of Hadoop, we need to change the configuration files inside the conf
directory and start/stop cluster using files in the bin
directory. In the newer versions of Hadoop, the configuration needs to be done in the etc/hadoop
directory, and start/stop processes can be initiated using the scripts found in the sbin
directory.
Now, let's start building the Apache Hadoop cluster. The following are the steps to install and configure Apache Hadoop:
Cd ~/Downloads
wget http://www.carfab.com/apachesoftware/hadoop/common/stable/hadoop-2.2.0.tar.gz
Refer to the following screenshot:
mv hadoop-2.2.0.tar.gz ~/hbinaries tar –xvzfhadoop-2.2.0.tar.gz mv hadoop-2.2.0 hadoop cd ~/hbinaries
Have a look at the next screenshot:
etc
folder where the configuration files lie:etc/hadoop
:
core-default.xml
: This contains common default properties for the whole cluster
hdfs-default.xml
: This contains HDFS-related default settings
yarn-default.xml
: This contains the YARN framework-related default settings
mapred-default.xml
: This contains MapReduce-related default settings
core-site.xml
: The values given in this file override the core-default.xml
settings
hdfs-site.xml
: The values given in this file override the hdfs-default.xml
settings
yarn-site.xml
: The values given in this file override the yarn-default.xml
settings
mapred-site.xml
: The values given in this file override the mapred-default.xml
settings
We have some runtime files where we need to make changes, such as the hadoop-env.sh
file that runs as an initial script for the Hadoop start up process. In this file, we set runtime variables such as a Java path and heap memory settings, and we execute the Java processes.
So, we can think of the *-default
files as the files that are by default shipped with Hadoop with preset parameters and values. If we need to change some parameter value, we can add it to site-specific files such as *.site.xml
.
These files are in the XML format, they start and end with a <configuration>
tag, and all parameters lie in between the two tags. Inside <configuration>
, we have <property><name>{name of parameter}|</name><value>{value of the parameter}</value><description>{explanation about the tag}</description>
.
The following is a sample configuration file:
Hadoop runs in three modes. We will not discuss these in detail, only briefly so that we can understand the mode of operations:
Configuring pseudo- and fully distributed configuration is almost the same. Once configured as pseudo, we can distribute configuration files to other nodes, with some changes in configuration files, and can make it a fully distributed cluster.
Now, let's make changes in the required files. For this, what we can do is open these files in any text editor such as gedit or nano, or we can use command-line editing tools such as vi or vim.
Here, we will configure the newer version (V2) of Hadoop, which uses YARN for the MapReduce framework for configuration of older versions (1 or lower) of Hadoop and HBase. Follow these links to learn how to configure it step by step:
https://www.youtube.com/watch?v=Mmqv-CvSTaQ
http://helpmetocode.blogspot.com
Some configurations parameters are mentioned in the following sections.
The core-site.xml
configuration file overrides the default values for the core Hadoop properties.
The following is the configuration for Hadoop newer versions:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://infinity:9000</value> <description>Namenode address</description> </property> </configuration>
The following is the configuration for Hadoop older versions:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://infinity:9000</value> <description>Namenode address</description> </property> </configuration>
The hostname might vary according to your machine name, and you can give any unused port number.
The hdfs-site.xml
file contains the site-wide property to enable/disable the features such as various paths and tuning parameters. You will see the following code in this file:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/disk1/namenodemetacopy1,/disk2/amenodemetacopy2</value> <description>This is the directory where hadoopmetadata will be stored, always better to have more than one copy of metadata directory for robustness</description> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> <description>This is block size of Hadoop that depends on data you are hosting. We can keep it from 6</description> </property> <property> <name>dfs.datanode.data.dir</name> <value>/disk1/datadirectory1,/disk1/datadirectory2</value> <description>This parameter defines where actual data will be stored on DataNodes. We can give one or more directory paths separated by commas. There should be proper permission on the directory which we specify here.</description> </property> <property> <name>dfs.replication</name> <value>5</value> <description>This defines the replication factor for not-so-valuable data and huge amount of data. We can keep it lower and for costly data, we can keep it higher. This defines how many copies of each block of a file will be kept on the cluster and this is how reliably the data will be stored if machines fail.</description> </property> <property> <name>dfs.namenode.handler.count</name> <value>200</value> <description>Defines number of NameNode threads higher value means more datanodes.</description> </property> </configuration>
The YARN framework-related configuration is stored in the yarn-site.xml
file. Have a look at the following code:
<configuration> <property> <name>yarn.resourcemanager.address</name> <value>infinity:9002</value> <description>This is host where yarn manger will be hosted</description> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>infinity:9003</value> <description>Yarn scheduler address</description> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>infinity:9004</value> <description>Management interface of resource manager</description> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>infinity:9005</value> <description>resource manager resource tracker interface</description> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> <description>Specifies mapreduce framework to be used</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>1000</value> <description>Max memory forschedular</description> </property> </configuration>
The mapred-site.xml
file contains the MapReduce framework-related configurations. Have a look at the following code:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>Framework to be used for map reduce</description> </property> </configuration>
There are many other configuration parameters that can be found in the *-default.xml
files and added to the *-site.xml
files to override the default parameters. We will discuss more parameters and their optimal values in Chapter 6, HBase Cluster Maintenance and Troubleshooting, and Chapter 7, Scripting in HBase.
The hadoop-env.sh
file contains the environment and runtime variables, which is as follows:
export JAVA_HOME=<path of Java here> export HADOOP_CONF_DIR=<hadoopconf directory path here> export HADOOP_HEAPSIZE=<amount of memory available to jvm>
The yarn-env.sh
file contains the YARN framework-related runtime and environment configuration:
export JAVA_HOME=<path of Java here> export YARN_CONF_DIR=<configuration conf directory of yarn/hadoop> JAVA_HEAP_MAX=-Xmx1000m
The slaves
file contains the list of hosts where DataNode daemons will run. The following are the host names of DataNode machines:
datanode1 datanode2 datanode3
Add the nodes where you want to host DataNode services.
After all these changes are made, verify the paths and parameters in these files. Once verified, we can move forward to start Hadoop services. Once all Hadoop services start successfully, we can move forward to configure HBase.
18.225.95.245