We will discuss this section in a bit questionnaire manner, and will come to understand HBase with the help of scenarios and conditions.
The following figure shows the flow of HBase: birth, growth, and current status:
It all started in 2006 when a company called Powerset (later acquired by Microsoft) was looking forward to building a natural language search engine. The person responsible for the development was Jim Kellerman and there were also many other contributors. It was modeled around the Google BigTable white paper that came out in 2006, which was running on Google File System (GFS).
It started with a TAR file with a random bunch of Java files with initial HBase code. It was first added to the contrib
directory of Hadoop as a small subcomponent of Hadoop and with the dedicated effort for filling up gaps; it has slowly and steadily grown into a full-fledged project. It was first added with Hadoop 0.1.0 and as it become more and more feature rich and stable, it was promoted to Hadoop subproject and then slowly with more and more development and contribution from the HBase user and developer group, it has became one of the top-level projects at Apache.
The following figure shows HBase versions from the beginning till now:
Let's have a look at the year-by-year evolution of HBase's important features:
.META.
from file system data, snapshot support, and many moreFor more information, just visit https://issues.apache.org/jira/browse/HBASE and explore more detailed explanations and more lists of improvements and the addition of new features.
While we wait for new features in v1.0; we can always visit http://hbase.apache.org for the latest releases and features.
Here is the link from where we can download HBase versions:
Let's now discuss HBase and Hadoop compatibility and the features they provide together.
Prior to Hadoop v1.0, when DataNode used to crash, HBase Write-Ahead Log—the logfiles that maintain the read/write operation before the final writing is done to the MemStore—would be lost and hence the data too. This version of Hadoop integrated append branch into the main core, which increased the durability for HBase. Hadoop v1.0 has also implemented the facility of disk failure, making RegionServer more robust.
Hadoop v2.0 has integrated high availability of NameNode, which also enables HBase to be more reliable and robust by enabling the multiple HMaster instances. Now with this version of HBase, upgrading has become easy because it is made independent of HDFS upgrades. Let's see in the following table how recent versions of Hadoop have enhanced HBase on the basis of performance, availability, and features:
Criteria |
Hadoop v1.0 |
Hadoop v2.0 |
Hadoop v2.x |
---|---|---|---|
Features |
Durability using |
| |
Performance |
Short-circuit read |
|
|
Miscellaneous features in newer HBase are HBase isolation and allocation, online-automated repair of table integrity and region consistency problems, dynamic configuration changes, reverse scanning (stop row to start row), and many other features; users can visit https://issues.apache.org/jira/browse/HBASE for features and advancement of each HBase release.
Here let's discuss various components of HBase and their components recursively:
ZooKeeper is a high-performance, centralized, multicoordination service system for distributed application, which provides a distributed synchronization and group service to HBase.
It enables the users and developer to focus on the application logic and not on the coordination with the cluster, for which it provides some API that can be used by the developers to use and implement coordination task such as master server, and managing application and cluster communication system.
In HBase, ZooKeeper is used to elect a cluster master in order to keep track of available and online servers, and to keep the metadata of the cluster. ZooKeeper APIs provide:
The following figure shows ZooKeeper:
It was developed at Yahoo Research. And the reason behind the name ZooKeeper is that in Hadoop system, projects are based on animal names, and in discussion regarding naming this technology, this name emerged as it manages the availability and coordination between different components of a distributed system.
ZooKeeper not only simplifies the development but also sits on the top distributed system as an abstraction layer to facilitate the better reachability to the components of the system. The following figure shows the request and response flow:
Let's consider a scenario wherein we have a few people who want to fill 10 rooms with some items. One instance would be where we will show how they find their way to the room to keep the items. Some of the rooms will be locked, which will lead the people to move on to other rooms. The other instance would be where we can allocate some representatives with information about the rooms, condition of rooms, and state of rooms (open, closed, fit for storing, not fit, and so on). We can then send them with items to those representatives for the information. The representative will guide the person towards the right room, which is available for storage of items, and the person can directly move to the specified room and store the item. This will not only ease the communication and the storage process but also reduce the overhead from the process. The same technique can be applied in the case of the ZooKeepers.
ZooKeeper maintains a tree with ZooKeeper data internally called a znode. This can be of two types:
ZooKeepers are based on a majority principle; it requires that we have a quorum of servers to be up, where quorum is ceil(n/2)
, for a cluster of three nodes ensemble means two nodes must be up and running at any point of time, and for five node ensemble, a minimum three nodes must be up. It's also important for election purpose for the ZooKeeper master. We will discuss more options of configuration and coding of ZooKeeper in later chapters.
HMaster is the component of the HBase cluster that can be thought of as NameNode in the case of Hadoop cluster; likewise, it acts as a master for RegionServers running on different machines. It is responsible for monitoring all RegionServers in an HBase cluster and also provides an interface to all HBase metadata for the client operations. It also handles RegionServer failover, and region splits.
There may be more than one instance of HMaster in an HBase cluster that provides High Availability (HA). So, if we have more than one master, only one master is active at a time; at the start up time, all the masters compete to become the active master in the cluster and whichever wins becomes the active master of the cluster. Meanwhile, all other master instances remain passive till the active master crashes, shuts down, or loses a lease from the ZooKeeper.
In short, it is a coordination component in an HBase cluster, which also manages and enables us to perform an administrative task on the cluster.
Let's now discuss the flow of starting up the HMaster process:
HMaster exports some of the following interfaces that are metadata-based methods to enable us to interact with HBase:
Related to |
Facilities |
---|---|
HBase tables |
Creating table, deleting table, enabling/disabling table, and modifying table |
HBase column families |
Adding columns, modifying columns, and removing columns |
HBase table regions |
Moving regions, assigning regions, and unassign regions |
In HBase, there is a table called .META.
(table name on file system), which keeps all information about regions that is referred by HMaster for information about the data. By default, HMaster runs on port number 60000 and its HTTP Web UI is available on port 60010, which can always be changed according to our need.
HMaster functionalities can be summarized as follows:
If master goes down, in this scenario, the cluster may continue working normally as clients talk directly to RegionServers. So, cluster may still function steadily. The HBase catalog table (.META.
and -ROOT-
) exists as HBase tables and it's not stored in master resistant memory. However, as master performs critical functions such as RegionServers' failovers and region splits, these functions may be hampered and if not taken care will create a huge setback to the overall cluster functioning, so the master must be started as soon as possible.
So now, Hadoop is HA enabled and thus HBase can always be made HA using multiple HMasters for better availability and robustness, so we can now consider having multiple HMaster.
RegionServers are responsible for holding the actual raw HBase data. Recall that in a Hadoop cluster, a NameNode manages the metadata and a DataNode holds the raw data. Likewise, in HBase, an HBase master holds the metadata and RegionServer's store. These are the servers that hold the HBase data, as we may already know that in Hadoop cluster, NameNode manages the metadata and DataNode holds the actual data. Likewise, in HBase cluster, RegionServers store the raw actual data. As you might guess, a RegionServer is run or is hosted on top of a DataNode, which utilizes the underlying DataNodes at underlying file system, that is, HDFS.
The following figure shows the architecture of RegionServer:
RegionServer performs the following tasks:
The following are the components of RegionServers
We will discuss more about these components in the next chapter.
Client is responsible for finding the RegionServer, which is hosting the particular row (data). It is done by querying the catalog tables. Once region is found, the client directly contacts RegionServers and performs the data operation. Once this information is fetched, it is cached by the client for further fast retrieval. The client can be written in Java or any other language using external APIs.
There are two tables that maintain the information about all RegionServers and regions. This is a kind of metadata for the HBase cluster. The following are the two catalog tables that exist in HBase:
At the beginning of the start up process, the .mMeta
location is set to root from where the actual metadata of tables are read and read/write continues. So, whenever a client wants to connect to HBase and read or write into table, these two tables are referred and information is returned to client for direct read and write to the RegionServers and the regions of the specific table.
The following is a list of just a few companies that use HBase in production. There are many companies who are using HBase, so we will list a few and not all.
And we can keep listing, but we will stop it here and for further information, please visit http://wiki.apache.org/hadoop/Hbase/PoweredBy.
Using HBase is not the solution to all problems; however, it can solve a lot of problems efficiently. The first thing is that we should think about the amount of data; if we have a few million rows and a few read and writes, then we can avoid using it. However, think of billions of columns and thousands of read/write data operations in a short interval, we can surely think of using HBase.
Let's consider an example, Facebook uses HBase for its real-time messaging infrastructure and we can think of how many messages or rows of data Facebook will be receiving per second. Considering that amount of data and I/O, we can currently think of using HBase. The following list details a few scenarios when we can consider using HBase:
There are many other cases where we can use HBase and it can be beneficial, which is discussed in later chapters.
Let's now discuss some points when we don't compulsorily have to use HBase just because everyone else is using it:
The following is the list of some HBase tools that are available in the development world:
For reference, go to http://sourceforge.net/projects/hbaseexplorer/.
For reference, go to http://www.toadworld.com/products/toad-for-cloud-databases/default.aspx.
For reference, go to http://sourceforge.net/projects/haredbhbaseclie/.
For reference, go to https://github.com/NiceSystems/hrider.
For reference, go to https://github.com/sentric/hannibal.
For reference, go to http://sematext.com/spm/.
For reference, go to https://github.com/forcedotcom/phoenix and http://phoenix.apache.org/.
As there are compatibility issues in almost all systems; likewise, HBase also has compatibility issues with Hadoop versions, which means all versions of HBase can't be used use on top of all Hadoop versions. The following is the version compatibility of Hadoop-HBase that should be kept in mind while configuring HBase on Hadoop (credit: Apache):
Hadoop versions |
HBase 0.92.x |
HBase 0.94.x |
HBase 0.96.0 |
HBase 0.98.0 |
---|---|---|---|---|
Hadoop 0.20.205 |
Supported |
Not supported |
Not supported |
Not supported |
Hadoop 0.22.x |
Supported |
Not supported |
Not supported |
Not supported |
Hadoop 1.0.0-1.0.2 |
Supported |
Supported |
Not supported |
Not supported |
Hadoop 1.0.3+ |
Supported |
Supported |
Supported |
Not supported |
Hadoop 1.1.x |
Not tested enough |
Supported |
Supported |
Not supported |
Hadoop 0.23.x |
Not supported |
Supported |
Not tested enough |
Not supported |
Hadoop 2.0.x-alpha |
Not supported |
Not tested enough |
Not supported |
Not supported |
Hadoop 2.1.0-beta |
Not supported |
Not tested enough |
Supported |
Not supported |
Hadoop 2.2.0 |
Not supported |
Not tested enough |
Supported |
Supported |
Hadoop 2.x |
Not supported |
Not tested enough |
Supported |
Supported |
We can always visit https://hbase.apache.org for more updated version compatibility between HBase and Hadoop.
18.223.172.132