As we have already set up Java, SSH, NTP, DNS, and Hadoop, let's configure HBase. HBase runs in these modes:
Configuring HBase in a standalone mode is simple since we don't have to make many configuration changes. We can just download the TAR file, extract it, and start the daemons. HBase configured in this mode will not use HDFS, but file://
, to store data files on top of the local file system.
Download the HBase binary from http://apache.mirrors.pair.com/hbase/. Here, we will use the latest compatible version of HBase. Refer to the compatibility table to understand which HBase we need to download for which Hadoop.
Once the download is complete, extract the TAR file and move it to a location where you want to configure it, such as the ~/hbinaries/hbase
path:
tar –xvzfhbase-0.94.18.tar.gz //extract tar mv hbase-0.94.18 hbase //rename to hbase mv hbase ~/hbinaries/hbase //move to configuration location cd ~/hbinaries/hbase // go to ~/hbinaries/hbase directory
The following is how the terminal window looks like:
Once extracted, move the file to the desired location to start the process as it is. It will store the tmp files, log files, and other files to the default locations, which are defined in the *-default.xml
file, mostly in the /tmp
or current directory. This is why the standalone mode is not advisable as everything gets lost once the system is started. On every reboot or start up process, the content of the /tmp
directory is removed.
This mode can be used version of to get familiar with the directory structure, start up HBase, and run some basic HBase commands. Now, let's configure HBase in the distributed mode.
We can configure a pseudo-distributed mode and distribute the configuration files to nodes that we like to have in the cluster where we need to run RegionServers. So, let's configure HBase in the distributed mode.
The following are the files that need to be changed or configured:
hbase-site.xml
: This is an HBase site-specific file. We can add configurations related to the HBase root and log directories in this file. All default settings from the hbase-default.xml
file can be overridden in this file.hbase-env.sh
: We can define runtime variables such as Java path, RegionServer-related options, and JVM settings in this file.regionservers
: This file contains all the node names where we want to host or run our RegionServer daemon.Now, let's set the values and start configuring.
Let's see the configuration we need to add in the hbase-site.xml
file:
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://infinity:9000/hbase</value> <description>Here, we need to enter Hadoop NameNode address followed by the hbase directory name where hbase files are to be stored.</description> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>This parameter decides whether HBase will run in local mode or distributed mode.</description> </property> <property> <name>hbase.tmp.dir</name> <value>/mnt/disk1/tmp</value> <description>Using this parameter, we specify tmp directory for HBase.</description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>infinity</value> <description>Using this parameter, we can specify ZooKeeper host machines addressee.</description> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2081</value> <description>port at which client can connect to zookeeper</description> </property> </configuration>
We can define runtime parameters in this file. This file will include the following lines:
export JAVA_HOME=<path of Java here>
In the preceding line, we mention the Java path.
export HBASE_HEAPSIZE=8000
In the preceding line, we mention the heap memory size.
export HBASE_MANAGES_ZK=true
The preceding command will be false if it is in the fully distributed mode and ZooKeepers are installed as separate instances on different machines. For the pseudo-distributed mode, or if we need to manage HBase's built-in ZooKeeper, we need to keep it to true
; if we have a separate ZooKeeper, we can make it false
.
In this file, we list all the servers we need to run RegionServers:
infinity
All set, now let's start HBase. This is in the pseudo-distributed mode, so Hadoop must run first. Start the Hadoop processes, and then we will start HBase:
bin/start-hbase.sh
In the preceding command, we run the script that will start all the required processes.
Alternatively, we can start HBase in the following ways:
bin/hbase-daemon.sh start zookeeper
bin/hbase-daemon.sh start master
bin/hbase-daemon.sh start regionserver
This will start HBase in the pseudo-distributed mode with all the processes running on the same server. If we need to make HBase fully distributed, we will have to add all the DataNode addresses to a regionserver
file, wherein we will run HBase RegionServers.
There is a small difference between the pseudo-distributed and fully distributed modes. First, we will add the following settings to hbase-site.xml
and add the hostnames of DataNode where we need to host our RegionServer. In this setup, we will have separate instances of ZooKeeper (odd number). Start a master server on the NameNode server, RegionServers on different DataNodes, and ZooKeeper processes.
There are two ways to start it; if a passwordless SSH is configured and the master server is able to connect to all DataNodes, we can use start-hbase.sh
, which will start HMaster on NameNode and RegionServers on the listed DataNodes.
We must first start all ZooKeepers, and then we can run the start-hbase.sh
script. Another method is to start processes using hbase-daemon.sh
to start HMaster on the NameNode server and RegionServer on DataNodes, for which we need to run this script in the same way we called in starting hbase in pseudo mode.
A few settings that we need to add or change in hbase-site.xml
are as follows:
<property> <name>hbase.zookeeper.quorum</name> <value>zkhost1:2181,zkhost2:2181,zkhost3:2181</value> <description>List of zookeeper instances</description> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/mnt/disk1/zookeeperData</value> <description>zk data directory path</description> </property>
We also need to add DataNode addresses in the regionserver
file, as follows:
datanode1 datanode2 datanode3
Change the line export HBASE_MANAGES_ZK=true
to export HBASE_MANAGES_ZK=false
in hbase-env.sh file. This will instruct HBase not to use the inbuilt ZooKeeper as there exist separate ZooKeeper instances. The preceding configuration in hbase-site.xml
will tell where ZooKeeper instances are running, and data directories for the ZooKeeper location are defined by the second parameter.
After making these changes, we can restart the cluster, and these settings will be loaded.
Once all these settings are complete, we need to copy the configuration files to other nodes too.
Now, let's see how to install and configure ZooKeeper instances. It is advisable that we configure odd number of ZooKeeper instances. Let's consider that we want three servers. Download a ZooKeeper copy on these three servers and make changes in the zoo.cfg
file that lies in ZooKeeper's conf
directory.
18.190.217.253