Calculating the data size stored in HBase

In the case of any database, whether it is RDBMS or NoSQL, we always need to find out the record size in order to plan the storage size needed, or to in order do a capacity planning. Even a few bytes per record might bring drastic changes to the data storage size that we estimate. For example, suppose we have one extra byte attached to each record, and we have around one billion records, and this extra byte requires around 1 GB of storage space on the disk.

Now, let's consider this data size calculation in case of HBase. Let's consider a table named employee, where we have fields such as the row key, the column family, the column, and the value. In HBase, each value is stored as fully qualified, so for each column of a record, it is accompanied with the row key we assign. So, let's now consider the space requirement.

As HBase stores data in the key-value format, let's now do the approximation. We will consider the row key as student1.

Key size

Value size

Row size

Row data

Col fam size

Col fam data

Column size

Timestamp

Key type

Actual value

Int (4)

Int(4)

Short(2)

Byte array

Byte (1)

Byte array

Byte array

Long (8)

Byte (1)

Byte array

Let's calculate the requirement of fixed size, which is 4 + 4 + 2 + 1 + 8 + 1 and equals 20 bytes. For other parts, we need to calculate the byte array sizes of the different values, so the total size is Total = fixed size + variable size.

Suppose we have one billion records, then the total size will be around 40 bytes * one billion = 40 billion bytes, which will be around 40 GB, and therefore, we can calculate according to the number of columns and rows in HBase. There is the option of compression in an HBase table, using which we can minimize the requirements of storage drastically.

We can implement compression while creating the table, as follows:

hbase>create 'tableWithCompression',{ NAME =>'colFam',COMPRESSION =>'SNAPPY'}

This will implement the Snappy compression algorithm on the records inserted in an HBase table. There are also other compression algorithms we can use as Snappy, such as LZF, LZO, and ZLIB.

Some benchmarking on the use of algorithm follows, so use of algorithms should be decided accordingly. Have a look at the following table:

Algorithm

IO performance

Compression ration achieved

ZLIB

Performance degraded

Best compression provided around (45 percent to 50 percent)

LZO

Around 4 percent to 6 percent

Around 41 percent to 45 percent

LZF

Around 20 percent to 22 percent)

Around 38 percent to 40 percent

Snappy

Around 24 percent to 28 percent)

Around 38 percent to 41 percent

Also, the compression depends on the type of data present in the table, so compression ration should be accordingly selected. Suppose we need more compression but less performance, we can always go with ZLIB, and if we need performance with an average compression, we can choose Snappy or whichever suits our data in the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.94.190