Configuring Apache HBase

As we have already set up Java, SSH, NTP, DNS, and Hadoop, let's configure HBase. HBase runs in these modes:

  • Standalone: This mode uses file systems to store tables and data, not HDFS. All the daemon processes run under a single JVM. This mode is suitable for testing purposes, but if you want to have real power of HBase, you need to configure it with Hadoop.
  • Distributed: This can be divided into the following two modes:
    • Pseudo-distributed: This uses HDFS file systems. All daemon processes run under a single JVM on a single machine. This is best for testing purposes; it also provides the power of Hadoop and can be configured with fewer resources on a single machine.
    • Fully distributed: This uses HDFS as an underlying file system. All of its daemon processes run under different JVMs on different machines. This mode is best for the production environment.

Configuring HBase in the standalone mode

Configuring HBase in a standalone mode is simple since we don't have to make many configuration changes. We can just download the TAR file, extract it, and start the daemons. HBase configured in this mode will not use HDFS, but file://, to store data files on top of the local file system.

Download the HBase binary from http://apache.mirrors.pair.com/hbase/. Here, we will use the latest compatible version of HBase. Refer to the compatibility table to understand which HBase we need to download for which Hadoop.

Note

Use the following command to download the desired HBase binary:

wget http://apache.mirrors.pair.com/hbase/<version to download .tar>

Once the download is complete, extract the TAR file and move it to a location where you want to configure it, such as the ~/hbinaries/hbase path:

tar –xvzfhbase-0.94.18.tar.gz //extract tar
mv hbase-0.94.18 hbase         //rename to hbase
mv hbase ~/hbinaries/hbase     //move to configuration location
cd ~/hbinaries/hbase           // go to ~/hbinaries/hbase directory

The following is how the terminal window looks like:

Configuring HBase in the standalone mode

Once extracted, move the file to the desired location to start the process as it is. It will store the tmp files, log files, and other files to the default locations, which are defined in the *-default.xml file, mostly in the /tmp or current directory. This is why the standalone mode is not advisable as everything gets lost once the system is started. On every reboot or start up process, the content of the /tmp directory is removed.

This mode can be used version of to get familiar with the directory structure, start up HBase, and run some basic HBase commands. Now, let's configure HBase in the distributed mode.

Configuring HBase in the distributed mode

We can configure a pseudo-distributed mode and distribute the configuration files to nodes that we like to have in the cluster where we need to run RegionServers. So, let's configure HBase in the distributed mode.

The following are the files that need to be changed or configured:

  • hbase-site.xml: This is an HBase site-specific file. We can add configurations related to the HBase root and log directories in this file. All default settings from the hbase-default.xml file can be overridden in this file.
  • hbase-env.sh: We can define runtime variables such as Java path, RegionServer-related options, and JVM settings in this file.
  • regionservers: This file contains all the node names where we want to host or run our RegionServer daemon.

Now, let's set the values and start configuring.

hbase-site.xml

Let's see the configuration we need to add in the hbase-site.xml file:

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://infinity:9000/hbase</value>
    <description>Here, we need to enter Hadoop NameNode address followed by the hbase directory name where hbase files are to be stored.</description>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>This parameter decides whether HBase will run in local mode or distributed mode.</description>
  </property>

  <property>
    <name>hbase.tmp.dir</name>
    <value>/mnt/disk1/tmp</value>
    <description>Using this parameter, we specify tmp directory for HBase.</description>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>infinity</value>
    <description>Using this parameter, we can specify ZooKeeper host machines addressee.</description>
  </property>

  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2081</value>
    <description>port at which client can connect to zookeeper</description>
  </property>

</configuration>

HBase-env.sh

We can define runtime parameters in this file. This file will include the following lines:

export JAVA_HOME=<path of Java here>

In the preceding line, we mention the Java path.

export HBASE_HEAPSIZE=8000

In the preceding line, we mention the heap memory size.

export HBASE_MANAGES_ZK=true

The preceding command will be false if it is in the fully distributed mode and ZooKeepers are installed as separate instances on different machines. For the pseudo-distributed mode, or if we need to manage HBase's built-in ZooKeeper, we need to keep it to true; if we have a separate ZooKeeper, we can make it false.

regionservers

In this file, we list all the servers we need to run RegionServers:

infinity

All set, now let's start HBase. This is in the pseudo-distributed mode, so Hadoop must run first. Start the Hadoop processes, and then we will start HBase:

bin/start-hbase.sh

In the preceding command, we run the script that will start all the required processes.

Alternatively, we can start HBase in the following ways:

  1. Start the ZooKeeper:
    bin/hbase-daemon.sh start zookeeper
    
  2. Start the HMaster:
    bin/hbase-daemon.sh start master
    
  3. Start the RegionServer:
    bin/hbase-daemon.sh start regionserver
    

This will start HBase in the pseudo-distributed mode with all the processes running on the same server. If we need to make HBase fully distributed, we will have to add all the DataNode addresses to a regionserver file, wherein we will run HBase RegionServers.

There is a small difference between the pseudo-distributed and fully distributed modes. First, we will add the following settings to hbase-site.xml and add the hostnames of DataNode where we need to host our RegionServer. In this setup, we will have separate instances of ZooKeeper (odd number). Start a master server on the NameNode server, RegionServers on different DataNodes, and ZooKeeper processes.

There are two ways to start it; if a passwordless SSH is configured and the master server is able to connect to all DataNodes, we can use start-hbase.sh, which will start HMaster on NameNode and RegionServers on the listed DataNodes.

We must first start all ZooKeepers, and then we can run the start-hbase.sh script. Another method is to start processes using hbase-daemon.sh to start HMaster on the NameNode server and RegionServer on DataNodes, for which we need to run this script in the same way we called in starting hbase in pseudo mode.

A few settings that we need to add or change in hbase-site.xml are as follows:

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>zkhost1:2181,zkhost2:2181,zkhost3:2181</value>
  <description>List of zookeeper instances</description>
</property>

<property>
  <name>hbase.zookeeper.property.dataDir</name>
  <value>/mnt/disk1/zookeeperData</value>
  <description>zk data directory path</description>
</property>

We also need to add DataNode addresses in the regionserver file, as follows:

datanode1
datanode2
datanode3

Change the line export HBASE_MANAGES_ZK=true to export HBASE_MANAGES_ZK=false in hbase-env.sh file. This will instruct HBase not to use the inbuilt ZooKeeper as there exist separate ZooKeeper instances. The preceding configuration in hbase-site.xml will tell where ZooKeeper instances are running, and data directories for the ZooKeeper location are defined by the second parameter.

After making these changes, we can restart the cluster, and these settings will be loaded.

Once all these settings are complete, we need to copy the configuration files to other nodes too.

Now, let's see how to install and configure ZooKeeper instances. It is advisable that we configure odd number of ZooKeeper instances. Let's consider that we want three servers. Download a ZooKeeper copy on these three servers and make changes in the zoo.cfg file that lies in ZooKeeper's conf directory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.217.253