Schema designing

HBase does not support any kind of joins, but it provides the single-indexing strategy on the row key. HBase schema design supports denormalization with nested entities. These nested entities are nothing but a column whose name is the unique identifier for the nested entity and whose value is the entire record mashed together. Since HBase allows dynamic column definition, there's no problem. Here's a great way to scale your joins. Additionally, with column families, large rows can be partitioned to small data chunks that can be read individually from a disk.

Schema or table design must be done at the initial phase, and we can add or remove columns on the fly, but we need to design our RowKey of table and column families at the initial schema design phase.

Some points that we might consider while designing a schema are as follows:

  • The row key is a very important aspect of schema design to consider. Row keys are indexed and provides the O(1) operation, which provides constant scheduling of fetching the data with a constant lookup speed.
  • Create a composite key by combining multiple values together, which can be used to set the relation between more tables.
  • In HBase, schema design revolves around application design.
  • HBase minimizes IO by keeping the column family and row together.
  • While designing the row key, column names are to be chosen intelligently as these are stored with the value in memory. So, design should minimize the column names; instead of a big column name such as employee_salary, we can have a name such as esal or similar.
  • We should use row atomicity as a design tool. HBase supports atomicity at the row level, which means if we need to update two tables atomically, we will find it difficult to update in one go. If we need atomicity on two tables, migrate one of the tables into another as nested entities.

Now, let's discuss it further.

As we have already discussed, the HBase data model is quite different from relational database systems. So, let's ask some questions that will help us design it in a better way:

  • What should the row key structure be? What all should it contain? What fields of different tables should it be made up of?
  • How many versions should there be for each row?
  • What information should be stored in the cell?
  • How many column families should there be and how should they be named?
  • Although we can add columns to column families on the fly even at the operation time, is it better to define or decide what and how many columns there are, and what their names should be?
  • How many columns are there for each column family?

Row key design is one of the most important design considerations. This is very important to read and write data into an HBase table. So, to define and design, we should consider different factors, which are as follows:

  • Number of tables in the design.
  • Indexing will be done on a row key.
  • A table is sorted on the basis of a row key. Each region tells about the start and stop row in the region, and the region stores the sorted list of rows from the start to end row key.
  • Everything is sorted as a byte array. There are no other data types such as string, integer, long, and so on when it comes to data stored in HBase table internally.
  • HBase guarantees atomicity only on a row key and multirow transaction is not supported; these might be introduced in future versions of HBase (above 0.98).
  • Column families to be defined at the time of table creation.
  • Column creation/addition is dynamic and can be added or defined at write time too.
  • HFile is sorted on the row key, qualifier, and timestamp.

Let's consider a scenario-based schema design now.

Suppose we have a scenario in which we need to design a table on a student-course relationship; we will have the following relation types as shown in the following figure:

Schema designing

Now, let's consider some use cases of table requirement design, such as how it is represented in RDBMS and HBase.

We generally have this scenario while designing a Student-Course relationship; so in RDBMS, we know that this can be represented as follows:

Student

Studcourse_Relation

Courses

Stud_ID (Primary Key)

Stud_Name

Stud_Age

Stud_Sex

Stud_Address

Stud_ID

Course_ID

Type

Course_ID (Primary Key)

Course_Title

Note

Instructor_ID

In HBase, the same thing can be represented as follows as there is no relation constrains, so we can implement this in many ways. Here is one we can implement:

So, the Student table and detail might look like the following:

Row_Keys

Column_Family (Student_Details_CF)

Stud_ID

Column (Student_Deatils)

Column (courseID)

 

Student_Deatils:Stud_Name

Student_Deatils:Stud_Age

Student_Deatils:Stud_Sex

Student_Deatils:Stud_Address

CourseID:course_ID

The Course table looks like the following:

Row_Keys

Column_Family (Course_Details_CF)

Course_ID

Column (Course_Deatils)

Column (StudentID)

 

Course_Deatils:Course_Title

Course_Deatils:Course_Note

Course_Deatils:Course_Instructor_ID

StudentID:Stud_ID

So, here we can set the relation on the basis of Student_ID and Course_ID; for example, getting the student ID and details from the first Student table and their equivalent courses from the Course table based on Student_ID from the Course table.

The second use case we will consider in this situation is how a user performs some task. So, we will see how to design a table and keep track of user activities.

Since we are recording a user activity, we will design the row key accordingly, which will contain a combination of userID, timestamp, and eventID.

User_Events

Event

RowKey (userID + timestamp + eventID) (Primary Key)

User_Name

Note

EventID

EventName

So, the first table, User_Events, will contain user details with the row key as a combination of a unique keys, current timestamps, event IDs, and user name, and will keep a note about the user or activity. The second table, which will be for event details, will have a set of events such as write, read, delete, update, edit, and so on. This information will be stored in the second Event table, and we can fetch all the operation performed by the user using these two tables.

Types of table designs

We can have two types of designs while considering a table. They are as follows:

  • Short and Wide: This design pattern can be considered in the following cases:
    • There is a small number of columns
    • There is a large number of rows
  • Tall-Thin: This design pattern can be considered in the following cases:
    • There is a large number of columns
    • There is a small number of rows

Now, let's consider a use case of blogging data, which needs to be created to store the blog entries to HBase. In this scenario, a user writes blog entries and saves data to HBase cell.

So, let's consider the scenario of a blogging website such as http://blogspot.com, http://wordpress.com, or any other blogging website where a user logs in, enters the content, and posts it. Internally, the content is stored in a table either as a column value or as a text file on a file system, and then it's linked to the database.

There can be two conditions that we can consider in the case of a design HBase table, which are as follows:

  • Each row might represent a single user (one row for a user and columns with the blog entries)
  • Each blog as a single row for which we will need to read multiple rows to read the blogs of a single user

Now, let's consider the following cases.

In a short-and-wide table design, we have all blogs stored in a single row and column family in an HBase table. Each newly created blog is stored in a dedicated column.

We will refer to the row ID here as User ID and to the column family as BlogEntriesCF; we will represent it as BECF.

Columns are added on the fly whenever a user creates a blog entry, and as columns, we will have a fixed string plus a timestamp attached to it, such as BEntry plus TimeStamp, which we will represent as BT.

RowKey (User_ID)

BECF

BECF:BT

BECF:BT

BECF:BT

BECF:BT

BECF:BT

BECF:BT

WriterA

HbaseEntry

HadoopEntry

WriterB

HadoopEntry

MongoEntry

HiveEntry

Writerc

SqoopEntry

HBaseEntry

sWriter(N)

Nth

This table grows horizontally with new columns that are added on the fly as the contents are added. In this case, the table grows towards the right-hand side, and not downwards, with new columns being added more quickly.

So, here we can read a whole row to get entries by a specific writer.

In a Tall-Thin table design, the table grows downwards more quickly than towards the right-hand side. Once new rows (user IDs attached with timestamps) are added with new blog entries, the blog entries are attached to a fixed column family and column. We can visualize this scenario as follows:

RowKey (UserID+TimeStamp)

BlogEntriesCF

BlogEntriesCF:Entries

WriterATimeStamp1

HBaseEntry

WriterBTimeStamp2

HadoopEntry

WriterATimeStamp3

HadoopEntry

WriterCTimeStampN

EntryN

Benefits of Short Wide and Tall-Thin design patterns

Now, let's see the benefits of both the design patterns:

Tall-Thin

Short Wide

If we query using a row ID, it will skip rows faster

This has to be queried using a column name that will not skip rows or store files

Not good for atomicity

Better for atomicity

Better for scalability

Not as good for scalability as Tall-Thin is

Tip

It's best to consider the Tall-Thin design as we know it will help in faster data retrieval by enabling us to read the single column family for user blog entries at once instead of traversing through many rows. Also, since HBase splits take place on rows, data related to a specific user can be found at one region server.

We will talk more about use case base schema design and coding in Chapter 10, HBase Use Cases. Now, here are some more tips about row key considerations:

  • Avoid generating continuous row key-like sequences or timestamps as this might result in the hanging up of the reading process during heavy writes.
  • Always keep the names of column families and row keys smaller in size as we know when a cell is stored it is preceded by a column name and column family name, so if we have bigger name, it will add up to the data storage size. So, for example, instead of StudentNameColumn, we can keep the column name as SNC, or something of this sort.
  • Row keys can be stored in their binary representations as opposed to string representations as it will require less space of storage.
  • If we need to reverse the scan of our table, we can add a reverse timestamp with the row key for faster scanning, for example, row key + (maxTimestamp-current).

Designing an efficient row key results in faster and optimized scanning/reading and writing process. This is why we need to consider a good row key, column family, and column name design.

Composite key designing

Composite keys can be created comprising various fields clubbed together to form a row key. This can be done as UserID + Seperator + DateString + SeperatorCharter + UserSessionID.

We can use the start and stop row keys in the HBase scan key to read a specific range of data we want to read. Let's see a scenario where we need to read the data and how to set the start and stop keys in the scan:

  • To read all the sessions for a given user, we can specify the start row as userId:
    HBase > table 'tableToScan',{STARTROW=>'userId'}
    

    In Java, we can specify it as:

    Scan s=new Scan(userID)
    Table.getScanner(s)

    It's the same in the following cases.

  • To find a specific session of a user, we can specify the full row key as the start row key
  • To find all sessions of a user or the session in a specific date range, we can specify UserID + SeperatorCharacter + DateString as the start and end row keys

Likewise, we can do different combinations to get specific ranges and the required data.

Real-time use case of schema in an HBase table

Here, we will list some use cases in the industry that use HBase as a backend to manage their applications and infrastructure.

There are many companies that use HBase in their production environment successfully, such as Trend Micro, eBay, Yahoo!, Facebook, and many other analytical-based companies. Some of the examples where this is being used include communications in Facebook messages, which are maintained in HBase, Trend Micro for security purposes, Nielsen for measurement purposes, Jive Software for enterprise collaboration, OCLC for digital media, Ancestry.com Inc. for DNA matching, and Box Inc. for machine data analysis.

Schema change operations

Schema and table management of an HBase table can be done through the Alter command. Using this command, we can perform the following operations:

  • Modify column family schema
  • Add column families
  • Remove column families
  • Change table-related settings such as maximum file size, MemStore flush size, read-only, and so on

In 0.94 and later versions of HBase, we rename a table using the snapshot feature, as follows:

hbase> disable 'TableToRename'
hbase> snapshot 'TableToRename', 'NewTable'
hbase>clone_snapshot 'NewTable', 'newTableToRename'
hbase>delete_snapshot 'NewTable'
hbase> drop 'TableToRename'

Now, let's discuss some schema-change-related operations.

We can change the versioning of a column family. Suppose we have a default version for a column family as 3, and lately, we realize that we need to have version 4, we can change the versioning as follows:

hbase> alter 'tableToAlter', {NAME => 'ColFam',VERSIONS => 4}

To remove or delete an existing column family, we can do it as follows:

hbase> alter 'tableToAlter', {NAME => 'colFam',METHOD => 'delete'}

If we need to enforce the maximum size of a column family to 256 MB, we can use the following command:

hbase> alter 'tableToAlter',{NAME=> 'colFam', METHOD => 'table_att', MAX_FILESIZE => 268435456}

If we need to add a new column family to an HBase table, we need to disable it first and then define a new name with already existing names, as shown in the following snippet. Suppose we have a table with colFam1, and we need to add colFam2, we can do it as follows:

hbase>disable 'tableToAlter'
hbase> alter 'tableToAlter' {NAME=>'colFam1',NAME=>'colFam2'}
hbase>enable 'tableToAlter'

We have the option of performing multiple operations in a single command. Suppose we need to change the versions of two column families, we can do it as follows:

hbase> alter 'tableToAlter',{NAME=>'colFam1',VERSIONS=>2}'{NAME=>'colFam2',VERSIONS=>5},{NAME=> 'toDeleteColFam',METHOD => 'delete}

Likewise, we can perform many operations in a single line of alter command. We can also see the status of the alter command, as follows:

hbase>alter_status 'tableToAlter'

We will talk more about the shell command in Chapter 6, HBase Cluster Maintenance and Troubleshooting, with more options.

Tip

We should always keep in mind that changes do not take immediate effect; they take place at the next major compaction. Until then, the old definition remains active.

There exists a project on GitHub using which we can easily create XML-based schema. For more information on the project, visit:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.41.212