Prerequisites and capacity planning for HBase

When we start configuring the HBase cluster, we always need to keep some considerations in mind. First things first, before configuring HBase, our Hadoop cluster must be up and running well.

Here, we will discuss various parameters from OS to network and from disk to processing and memory considerations. We will discuss some prerequisites for HBase and various factors that affect the cluster functioning. If our Hadoop is healthy, HBase will be healthy too, so we have to consider a good, healthy, and smoothly running Hadoop cluster on top of which we can have a full-fledged, healthy, and smoothly running HBase cluster.

In this section, we will discuss the considerations for a Hadoop as well as an HBase cluster. Then, in the following chapters, we will discuss the full-fledged, step-by-step cluster configuration using the top-down method to configure a Hadoop cluster, and then an HBase cluster, running in different modes that are standalone, pseudo-distributed, and fully distributed clusters of Apache and Cloudera distributions.

HBase uses a local hostname to report its IP address, so the cluster network and its machines must be forward and backward (reverse) DNS resolvable. Let's now discuss forward and backward (reverse) DNS resolution in brief. Take a look at the following figure:

Prerequisites and capacity planning for HBase

The forward DNS resolution

The forward DNS resolution uses the domain name to find the IP address of a machine in the network. This is also known as a Domain Name System, where a domain name server finds out or tells the equivalent IP address.

The reverse DNS resolution

The reverse DNS resolution uses an IP address to find out the hostname of a machine. This is important in configuration, as both resolutions are important and each machine in the cluster must be able to communicate with other machines using the hostname and the IP address.

Tip

Every machine must be accessible using the hostname as well as the IP address for proper functioning of the HBase cluster.

The following figure shows the basic prerequisites for configuring an HBase cluster:

The reverse DNS resolution

We will need a Linux distribution for a full-fledged production cluster. If we need support, we can go with the enterprise or paid versions of Linux, such as the Red Hat Enterprise edition, Debian-based distros with enterprise support. If we don't want to invest initially, we can go with free versions of these distros, such as CentOS, Ubuntu, or any other Linux distro. HBase clusters can be configured on Mac too, and there is another option to build testing and evaluation.

An HBase cluster can be configured on Windows too, which is not good for production. However, people who just want to test and evaluate it, and are not in a position to switch to Linux directly, can opt for this option. For this, we will need to install Cygwin (a software package that enables us to run native apps on Windows) on Windows and configure HBase cluster on it.

Java

We need to have Java installed on our system. This is one of the basic requirements as all the daemon processes run under JVM. We will discuss installing Java on various platforms such as Ubuntu, Red Hat distributions, and some other Linux distributions in the following chapters, where we will build up a cluster step-by-step. We must install Sun Java 6 (formally, Oracle Java) or a later version.

We can go for RPM versions of Red Hat distros and archive (tar) files for other distros.

In fact, Java is a must for Hadoop configuration, and we can configure HBase on top of Hadoop once Hadoop is running fine. The following is what Apache says about Java for Hadoop/HBase:

Hadoop requires Java 7 or a late version of Java 6. It is built and tested on both OpenJDK and Oracle (HotSpot)'s JDK/JRE.

Note

You can find more details on Java requirements at http://wiki.apache.org/hadoop/HadoopJavaVersions.

SSH

We can configure SSH for easy server-to-host communication. If our cluster is inside a secure network, we can configure a passwordless SSH. This is not compulsory, but if configured, we can use all HBase scripts, such as start-hbase.sh and others, if a passwordless SSH is configured.

SSH is cryptographic for secure communication between machines on a network, remote, remote execution, and other secure networks between two networked computers. It connects via an insecure network, a server and a client running, and programs. We will discuss configuring SSH on various platforms in the upcoming chapters.

The following figure shows how nodes communicate using the SSH (SSH is used only for configuration purposes, and HBase does not use it to communicate between its daemons) protocol:

SSH

Domain Name Server

HBase uses a local hostname to report its IP address. Related to this, we can have a host file-based DNS, or we need to set up a Domain Name Server to resolve the machines in a clustered network for production servers.

The primary network interface is used by HBase for communication, so we need to configure a hostname for our primary interface.

The following are the simple steps which we can use to verify correct DNS configurations to avoid issues related to DNS with HBase configuration and operation:

  1. Set a hostname for each machine in the cluster.
  2. Check if forward and reverse domains are the resolving means with which you can access nodes using both an IP and a hostname.
  3. Use a DNS verification tool to verify the correctness.

The following diagram shows where and how we need to change the DNS settings:

Domain Name Server

To be on the safe side, try to have both the host file- and DNS-based resolution policies. If DNS fails sometimes, the operation is not disrupted and nodes might communicate with each other even in the case of a DNS failure. Everything depends on the two important parameters, namely, the host file-based resolution and the DNS-based resolution policy. You can always change this parameter in the hbase-site.xml file to override the setting that is in hbase-default.xml, according to any network interface. You need to use one of the following files:

  • hbase.master.dns.interface
  • hbase.regionserver.dns.interface

If we configure our settings related to DNS, and the host resolution is accurate from the beginning, we can avoid a lot of issues that come up with RegionServers, ZooKeepers, and other components.

We can use the following commands in Linux to verify the settings for correctness and reachability, and once we are confident that the DNS-related settings are all correct, recheck, verify, and then move forward:

dig, nslookup, ping

The loopback IP must be set to 127.0.0.1 instead of 127.0.1.1 for the localhost.

Using Network Time Protocol to keep your node on time

The Network Time Protocol is a networking time protocol that keeps the machine time updated. In an HBase cluster, all machines must have a synchronized time. This service might be available; we need to enable it. If it is not available, we can always install it using a package manager available on the Linux distros we are using. Take a look at the following diagram:

Using Network Time Protocol to keep your node on time

The preceding diagram shows how an NTP server functions with HBase. We will see how to install this service at the configuration stage.

OS-level changes and tuning up OS for HBase

HBase tends to open a lot of files in operation, so we need an OS tune up and changes for better performance and trouble-free operations. Two parameters that we need to change in Linux distros are as follows:

  • Nproc: Number of processes active at a time under a user
  • Ulimit: Number of open files at a time under a specific user

To set nproc and ulimit values, we need to change it in the limits.conf file found in Linux. To check the content of this file, we need to execute the ulimit -a command. The following screenshot shows the existing and needed setting changes related to the OS level:

OS-level changes and tuning up OS for HBase

To find this file, navigate to the /etc/security directory (/etc/launchd.conf on OS X) that can be opened and modified in any text editor. The file can be changed from a command line too. For a permanent effect, we need to change it into a limit.conf file and save it to make it persistent.

In a newly installed system, the value for open files is usually 1024, which is not enough for setup as HBase opens a lot of files during read/write operations and processing, and it also starts a lot of new processes and subprocesses while running. Not properly configuring these parameters might result in a lot of runtime errors such as java.io.IOException (too many open files), OutOfMemoryException, and others; all the frequent exceptions will be discussed in the Troubleshooting the most frequent HBase errors and their explanation section of Chapter 6, HBase Cluster Maintenance and Troubleshooting. These parameters are not universal but can be set according to the existing system configuration. This also depends on the amount of heap memory available. On a node with good configuration, we can set the range between 24 K and 65 K, or more if required. However, there is a limit that depends on the system resources; changing these values incorrectly might break down the system.

There are two types of limits; they are hard limit and soft limit:

  • Soft limits are the currently enforced limits
  • Hard limits mark the maximum value that cannot be exceeded by setting a soft limit

So, we need to have the same value for both the hard and soft limits. However, soft limits will always be less than or equal to the hard limit. To change the hard limit, we need root access to the system, but the soft limit can be changed by the process. The hard limit is set by processes with superuser privileges, and it cannot be exceeded by processes running with lower privileges.

Once HBase and Hadoop starts functioning, it initiates the opening of more files or starts more processes that reach the OS limit. So, if we don't set it properly, the OS tends to kill this process, or due to the restriction, we will not be able to create a new required process or open a new file to read or write, which will cause runtime exception and affect the cluster. In fact, it might break down an HBase daemon or node. After changing the value, we need to reboot the system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.11.247