HBase housekeeping

As data is being added to HBase, it writes an immutable file to store. Each store is made up of column families, and regions consist of these row-key ordered files as it's immutable. So, there will be more files rather than one on the fly. Due to many files, the I/O will be slower, and hence lag in reading and writing, resulting in slower operation. To overcome these types of problems, HBase uses the compaction methodology; let's look into it now. Refer to the following figure for a better understanding:

HBase housekeeping

Compaction

As the name suggests, compaction makes files more compact and, hence, efficient to read up files. When new data is written to HBase, HFile is generated and the number of HFiles might increase the I/O overhead. So, to minimize this, the HFiles are merged to one HFile periodically. As MemStore gets filled, a new HFile is created. If these files are not merged in time, there will be a huge overhead on the system. Compaction is nothing but the merging of two or more HFiles using the N-way merge sort algorithms, since HFiles are already in a sorted order. Once files are merged, the new file is loaded and the older file is discarded or deleted.

There are different types of compactions; let's look at them now.

Minor compaction

Minor compaction takes place on multiple HFiles in HStore. In this type of compaction, a number of adjacent HFiles are picked up, merged, and rewritten into a larger single HFile. When this is done, the deleted or expired files are not removed, they are still present in the resulting HFile. Files to be merged in minor compaction are chosen heuristically. Minor compaction affects the HBase performance and, therefore, there is a limit on the number of files to be merged; by default, it is 10.

Major compaction

Major compaction folds all the HFiles together to form a single HFile. In this type of compaction, the deleted and expired records are discarded, and the active and non-deleted files are kept. Generally, it is manually triggered on large clusters. Major compaction is not a region merge, but it happens with HStore. In this, all the HFiles of a column family are merged. This compaction can also be triggered on an entire table. This is a time-consuming process and an expensive operation; it also affects the performance, so it must be triggered when there are fewer requests to the cluster. Refer to the following figure:

Major compaction

Region split

Region split is done by RegionServers. In a RegionServer, once a region becomes overloaded or exceeds the threshold value of 256 MB, it is spliced into new regions. The flow of region splitting is shown in the following figure:

Region split

The following is the flow of region splitting, as illustrated in the preceding figure:

  1. The region to be split is made offline by RegionServer.
  2. A region is spliced into two regions.
  3. The newer daughter region information is updated in the .META. table.
  4. The new daughter region formed is opened and made available.
  5. The region split information is passed to HMaster for an update.

Region assignment

This is one of the main tasks of HMaster. Let's see how it works:

  1. HBase HMaster calls AssignmentManager for region assignment.
  2. AssignmentManager looks into the current region assignment scenario in the .META. table. If the region assignment is correct and valid, it keeps the region; if region assignment is invalid or incorrect, LoadBalancerFactory creates DefaultLoadBalancer.
  3. Then, DefaultLoadBalancer assigns a new region to RegionServer.
  4. The whole assignment process is updated into the .META. table.
  5. Once this is done successfully, the assigned region is opened and made available by the corresponding RegionServer.

Region merge

As new regions are created on a region-size threshold, and since this might result in greater number of regions, it might bring high cost on memory, I/O, and throughput performance. When the RegionServer number threshold is at its maximum (usually, 100 per RegionServer), region merge is initiated by RegionServer. This process flows as follows:

  1. The client initiates the process for region merge and sends an RPC region merge request to HMaster.
  2. HMaster moves regions together to RegionServer and sends requests for a region merge operation.
  3. RegionServer makes regions offline to be merged, and regions are merged.
  4. The metadata of regions that are merged are deleted from the .META. table, and the new merged region's metadata is updated/added to the .META. table.
  5. The resulting region is then made online and available for reading, and HMaster is updated for the region information.

RegionServer failovers

When RegionServer fails, the region on the server goes offline and is not available for read or write. Once this happens, and HMaster detects it, the assignment of regions will be made invalid. The region assignment to another RegionServer will be initiated, and it will follow the same steps we discussed in the region assignment process.

All other information on regions and operations on them will be discussed in Chapter 6, HBase Cluster Maintenance and Troubleshooting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.134