HBase does not support any kind of joins, but it provides the single-indexing strategy on the row key. HBase schema design supports denormalization with nested entities. These nested entities are nothing but a column whose name is the unique identifier for the nested entity and whose value is the entire record mashed together. Since HBase allows dynamic column definition, there's no problem. Here's a great way to scale your joins. Additionally, with column families, large rows can be partitioned to small data chunks that can be read individually from a disk.
Schema or table design must be done at the initial phase, and we can add or remove columns on the fly, but we need to design our RowKey
of table and column families at the initial schema design phase.
Some points that we might consider while designing a schema are as follows:
employee_salary
, we can have a name such as esal
or similar.Now, let's discuss it further.
As we have already discussed, the HBase data model is quite different from relational database systems. So, let's ask some questions that will help us design it in a better way:
Row key design is one of the most important design considerations. This is very important to read and write data into an HBase table. So, to define and design, we should consider different factors, which are as follows:
Let's consider a scenario-based schema design now.
Suppose we have a scenario in which we need to design a table on a student-course relationship; we will have the following relation types as shown in the following figure:
Now, let's consider some use cases of table requirement design, such as how it is represented in RDBMS and HBase.
We generally have this scenario while designing a Student-Course relationship; so in RDBMS, we know that this can be represented as follows:
Student |
Studcourse_Relation |
Courses |
---|---|---|
Stud_ID (Primary Key) Stud_Name Stud_Age Stud_Sex Stud_Address |
Stud_ID Course_ID Type |
Course_ID (Primary Key) Course_Title Note Instructor_ID |
In HBase, the same thing can be represented as follows as there is no relation constrains, so we can implement this in many ways. Here is one we can implement:
So, the Student
table and detail might look like the following:
Row_Keys |
Column_Family (Student_Details_CF) | |
---|---|---|
Stud_ID |
Column (Student_Deatils) |
Column (courseID) |
Student_Deatils:Stud_Name Student_Deatils:Stud_Age Student_Deatils:Stud_Sex Student_Deatils:Stud_Address |
CourseID:course_ID |
The Course
table looks like the following:
Row_Keys |
Column_Family (Course_Details_CF) | |
---|---|---|
Course_ID |
Column (Course_Deatils) |
Column (StudentID) |
Course_Deatils:Course_Title Course_Deatils:Course_Note Course_Deatils:Course_Instructor_ID |
StudentID:Stud_ID |
So, here we can set the relation on the basis of Student_ID
and Course_ID
; for example, getting the student ID and details from the first Student
table and their equivalent courses from the Course
table based on Student_ID
from the Course
table.
The second use case we will consider in this situation is how a user performs some task. So, we will see how to design a table and keep track of user activities.
Since we are recording a user activity, we will design the row key accordingly, which will contain a combination of userID, timestamp, and eventID.
User_Events |
Event |
---|---|
RowKey (userID + timestamp + eventID) (Primary Key) User_Name Note |
EventID EventName |
So, the first table, User_Events
, will contain user details with the row key as a combination of a unique keys, current timestamps, event IDs, and user name, and will keep a note about the user or activity. The second table, which will be for event details, will have a set of events such as write, read, delete, update, edit, and so on. This information will be stored in the second Event
table, and we can fetch all the operation performed by the user using these two tables.
We can have two types of designs while considering a table. They are as follows:
Now, let's consider a use case of blogging data, which needs to be created to store the blog entries to HBase. In this scenario, a user writes blog entries and saves data to HBase cell.
So, let's consider the scenario of a blogging website such as http://blogspot.com, http://wordpress.com, or any other blogging website where a user logs in, enters the content, and posts it. Internally, the content is stored in a table either as a column value or as a text file on a file system, and then it's linked to the database.
There can be two conditions that we can consider in the case of a design HBase table, which are as follows:
Now, let's consider the following cases.
In a short-and-wide table design, we have all blogs stored in a single row and column family in an HBase table. Each newly created blog is stored in a dedicated column.
We will refer to the row ID here as User ID and to the column family as BlogEntriesCF; we will represent it as BECF.
Columns are added on the fly whenever a user creates a blog entry, and as columns, we will have a fixed string plus a timestamp attached to it, such as BEntry plus TimeStamp, which we will represent as BT.
RowKey (User_ID) |
BECF | ||||||
BECF:BT |
BECF:BT |
BECF:BT |
BECF:BT |
BECF:BT |
BECF:BT |
… | |
WriterA |
HbaseEntry |
HadoopEntry |
… |
… |
… |
… |
… |
WriterB |
HadoopEntry |
MongoEntry |
… |
… |
… |
HiveEntry |
… |
Writerc |
… |
… |
… |
SqoopEntry |
… |
HBaseEntry |
… |
… |
… |
… |
… |
… |
… |
… |
… |
sWriter(N) |
… |
… |
… |
… |
… |
Nth |
… |
This table grows horizontally with new columns that are added on the fly as the contents are added. In this case, the table grows towards the right-hand side, and not downwards, with new columns being added more quickly.
So, here we can read a whole row to get entries by a specific writer.
In a Tall-Thin table design, the table grows downwards more quickly than towards the right-hand side. Once new rows (user IDs attached with timestamps) are added with new blog entries, the blog entries are attached to a fixed column family and column. We can visualize this scenario as follows:
RowKey (UserID+TimeStamp) |
BlogEntriesCF |
---|---|
BlogEntriesCF:Entries | |
WriterATimeStamp1 |
HBaseEntry |
WriterBTimeStamp2 |
HadoopEntry |
WriterATimeStamp3 |
HadoopEntry |
… |
… |
… |
… |
… |
… |
WriterCTimeStampN |
EntryN |
Now, let's see the benefits of both the design patterns:
Tall-Thin |
Short Wide |
---|---|
If we query using a row ID, it will skip rows faster |
This has to be queried using a column name that will not skip rows or store files |
Not good for atomicity |
Better for atomicity |
Better for scalability |
Not as good for scalability as Tall-Thin is |
It's best to consider the Tall-Thin design as we know it will help in faster data retrieval by enabling us to read the single column family for user blog entries at once instead of traversing through many rows. Also, since HBase splits take place on rows, data related to a specific user can be found at one region server.
We will talk more about use case base schema design and coding in Chapter 10, HBase Use Cases. Now, here are some more tips about row key considerations:
Designing an efficient row key results in faster and optimized scanning/reading and writing process. This is why we need to consider a good row key, column family, and column name design.
Composite keys can be created comprising various fields clubbed together to form a row key. This can be done as UserID + Seperator + DateString + SeperatorCharter + UserSessionID.
We can use the start and stop row keys in the HBase scan key to read a specific range of data we want to read. Let's see a scenario where we need to read the data and how to set the start and stop keys in the scan:
userId
:HBase > table 'tableToScan',{STARTROW=>'userId'}
In Java, we can specify it as:
Scan s=new Scan(userID) Table.getScanner(s)
It's the same in the following cases.
Likewise, we can do different combinations to get specific ranges and the required data.
Here, we will list some use cases in the industry that use HBase as a backend to manage their applications and infrastructure.
There are many companies that use HBase in their production environment successfully, such as Trend Micro, eBay, Yahoo!, Facebook, and many other analytical-based companies. Some of the examples where this is being used include communications in Facebook messages, which are maintained in HBase, Trend Micro for security purposes, Nielsen for measurement purposes, Jive Software for enterprise collaboration, OCLC for digital media, Ancestry.com Inc. for DNA matching, and Box Inc. for machine data analysis.
Schema and table management of an HBase table can be done through the Alter
command. Using this command, we can perform the following operations:
In 0.94 and later versions of HBase, we rename a table using the snapshot feature, as follows:
hbase> disable 'TableToRename' hbase> snapshot 'TableToRename', 'NewTable' hbase>clone_snapshot 'NewTable', 'newTableToRename' hbase>delete_snapshot 'NewTable' hbase> drop 'TableToRename'
Now, let's discuss some schema-change-related operations.
We can change the versioning of a column family. Suppose we have a default version for a column family as 3
, and lately, we realize that we need to have version 4
, we can change the versioning as follows:
hbase> alter 'tableToAlter', {NAME => 'ColFam',VERSIONS => 4}
To remove or delete an existing column family, we can do it as follows:
hbase> alter 'tableToAlter', {NAME => 'colFam',METHOD => 'delete'}
If we need to enforce the maximum size of a column family to 256 MB, we can use the following command:
hbase> alter 'tableToAlter',{NAME=> 'colFam', METHOD => 'table_att', MAX_FILESIZE => 268435456}
If we need to add a new column family to an HBase table, we need to disable it first and then define a new name with already existing names, as shown in the following snippet. Suppose we have a table with colFam1
, and we need to add colFam2
, we can do it as follows:
hbase>disable 'tableToAlter' hbase> alter 'tableToAlter' {NAME=>'colFam1',NAME=>'colFam2'} hbase>enable 'tableToAlter'
We have the option of performing multiple operations in a single command. Suppose we need to change the versions of two column families, we can do it as follows:
hbase> alter 'tableToAlter',{NAME=>'colFam1',VERSIONS=>2}'{NAME=>'colFam2',VERSIONS=>5},{NAME=> 'toDeleteColFam',METHOD => 'delete}
Likewise, we can perform many operations in a single line of alter
command. We can also see the status of the alter
command, as follows:
hbase>alter_status 'tableToAlter'
We will talk more about the shell command in Chapter 6, HBase Cluster Maintenance and Troubleshooting, with more options.
There exists a project on GitHub using which we can easily create XML-based schema. For more information on the project, visit:
3.22.41.212