Chapter 2. Let's Begin with HBase

In the previous chapter, we learned in depth about HBase and its ecosystem. In this chapter, we will discuss HBase and its components in a bit more detail. This chapter will guide you through understanding the prerequisites and assumptions that one has to make when one starts using HBase. It will also focus on the requirements to configure HBase cluster and the parameters that you need to keep in mind to have a healthy and helpful HBase. You will also get to know HBase components and their deployment considerations. Let's take a look at the topics that we are going to discuss in this chapter:

  • HFile
  • HBase region
  • Scalability
  • Reading and writing cycle
  • Write-Ahead Logs
  • MemStore
  • Some HBase housekeeping concepts
  • Region operations
  • Capacity planning
  • List of available HBase distributions
  • Prerequisites for HBase

Understanding HBase components in detail

To understand the components of HBase, let's start from the bottom, from HFile to RegionServers, and then progress towards the master. There can be one to n RegionServers, one to n DataNodes, and one to n ZooKeeper nodes. Refer to the following figure:

Understanding HBase components in detail

HFile

HFile is designed after Google SSTable, which is a reinterpretation of Google's implementation based on their Bigtable paper. It was implemented after HBase v0.20.0; earlier, an alternate file format, that is, MapFile was being temporarily used. An HFile internally consists of HFile blocks that are its building blocks.

Note

Go through the links https://hbase.apache.org/book/apes03.html and https://issues.apache.org/jira/browse/HBASE-3857, and also go through the PDF files present on the previous link for an actual file representation of HFile on the disk.

Region

Regions are the basic blocks of RegionServers that provide distribution, availability, and storage for columns and column families on an HBase cluster. The overall structure goes like this:

HBase table

Table representation in HBase

Region

Region that constitutes the HBase table

Store

Store is per column family for every region, for each HBase table

MemStore

This exists for each region of the table and for each store

Store file

This exists for each region of the table and each MemStore

Block

This is the basic building block of store files

On HDFS, the structure looks like in the following figure:

Region

In the preceding figure, /hbase refers to the HBase directory on HDFS, /table is inside the /hbase directory, and so on. Once we have an HBase running cluster, we can navigate on HDFS to see the storage structure of the HBase directory. We can visit the Web UI of Hadoop NameNode and the /hbase directory that is created when we configure and start the daemon processes, such as HMaster RegionServers. The /hbase directory's name depends on what we assign for the hbase.rootdir setting in the HBase configuration. By navigating to the path, we can understand the logical storage of the HBase root directory on top of HDFS; the following diagram shows how the Write-Ahead Log (WAL) structure is stored on HDFS.

HLogs are the files that save all edit logs to HStore files. These are HBase WALs. Internally, it performs logfile rolling. There is one HLog file per RegionServer, and write-ahead (writing changes to a logfile and then performing the actual write) is performed on this logfile for every region on a particular RegionServer. This HLog file consists of multiple on-disk files. The following diagram shows the structure of HLog files on HDFS. In this figure, /.logs stands for the ./logs directory inside the HBase directory on HDFS.

Region

Scalability – understanding the scale up and scale out processes

In the previous chapter, we learned about the HBase scale out process; let's see what it means and how it's done. Let's discuss scale out and scale up, and which is better for what. In the case of HBase, it scales out and does not scale up, which is provided by the underlying HDFS file system and Hadoop, which is a distributed system and can scale out on the fly with just an addition of new machines, whenever it's needed. In HBase, we can always add a new Hadoop DataNode; on DataNodes, we can host many RegionServers for higher scalability. Refer to the following figure:

Scalability – understanding the scale up and scale out processes

Scale in

You must be aware of the fact that the traditional scaling of the system, application, and database depends on the capacity of the system on which they are hosted. This is called vertical scaling, where an application is migrated to a more powerful machine with more memory, processing power, and storage. In this type of scaling, there are limited powerful servers; a server cannot keep on growing in order of processing power, or even memory-wise, as there is always a limit to it at any given point in time. There might be only a specific processor, a server, or an OS available, which might support a specific amount of memory at one time, and which can't grow beyond this limit. So, these types of systems are not very scale friendly. This process of scaling vertically is really heavy on finance.

In this type of system, there is always a more powerful, centralized machine that is responsible for handling all the operations. With the increase in data size or the processing power requirement, the system struggles, and it is at that time that we need to upgrade the system to a better configuration. Some problems of scale up or vertically scaled systems are as follows:

  • Data migration and software and hardware upgradation
  • Application reconfiguration
  • Reconfiguration overhead

Some benefits of scale up systems are: less and one-time configuration till upgrade, less power, less cooling system, less space, and a centralized system.

Scale out

On the contrary, scale out or scale horizontally means virtually adding processing power, memory, and storage to the system. Here, servers are not replaced with a more powerful server, but a new machine is added to the system when there is a need for more storage, processing power, and main memory. Here, in this system, multiple machines work virtually as a single system to provide large-scale processing power. Let's discuss why we should choose the scale out-based system. Refer to the following figure for better understanding:

Scale out

A scale out-based system enables us to have a redundant and high availability system. It is cost effective, which means that there is no need to invest in high-end machines, no application migration overhead, and servers can be located in many locations. It is suitable for massive parallel computing, where a number of machines take up the workload evenly. The following figure shows the HBase scaling method:

Scale out

In HBase, we can add new RegionServers on the fly; for this, new DataNodes are added, the RegionServer daemon is started on these DataNodes, and scalability is obtained. In short, we first add a number to the cluster, and then start the DataNode and RegionServer daemons on the newly added node.

Let's talk about HBase communication between daemons (nodes). The different daemons and the HBase nodes communicate with each other using Remote Procedure Call (RPC), which enables the HBase components to make calls to in-built functions. It also enables each component to behave towards these calls as if they were local. This in turn enables the procedures or subroutines to be executed to a different address space, such as another computer system. This kind of intercommunication prevents the rewriting of the server architecture code.

The following figure shows the RPC flow:

Scale out

In HBase, HBaseRPC is the class that facilitates HBase to use RPC among the components. It is based on the Java dynamic proxy pattern. It uses an invoker class that implements InvocationHandler to intercept client-side method calls, and then it marshalls the method name and argument through HBaseClient. The communication between client and server using RPC works as follows:

  1. The client contacts ZooKeeper to find who the active HMaster is and what the location of the root RegionServer is.
  2. Then, the client communicates RegionServer using HRegionInterface to read/write the table.
  3. Client applications talk to HMaster using HMasterInterface in order to dynamically create a table, add a column family, and for other operations.
  4. Then, HMaster communicates to RegionServers using HRegionInterface to open, close, move, split, or flush the region.
  5. Active HMaster data and the root RegionServer location are cached into ZooKeepers by HMaster.
  6. RegionServer then reads the data from ZooKeeper to get information about log-splitting tasks, which is updated to fetch a task report status.
  7. RegionServer then communicates with HMaster using HMasterRegionInterface to convey information such as the loading of RegionServer, errors with RegionServer, and the start up process of RegionServers.

    Sometimes, RegionServer also communicates with the root region or the meta region, with the help of HRegionInterface, to check the current status of a region or to create a new daughter region while region splitting.

  8. This communication is repeated with a tick time interval or a threshold time interval to keep everything updated.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.73.175