As data is being added to HBase, it writes an immutable file to store. Each store is made up of column families, and regions consist of these row-key ordered files as it's immutable. So, there will be more files rather than one on the fly. Due to many files, the I/O will be slower, and hence lag in reading and writing, resulting in slower operation. To overcome these types of problems, HBase uses the compaction methodology; let's look into it now. Refer to the following figure for a better understanding:
As the name suggests, compaction makes files more compact and, hence, efficient to read up files. When new data is written to HBase, HFile is generated and the number of HFiles might increase the I/O overhead. So, to minimize this, the HFiles are merged to one HFile periodically. As MemStore gets filled, a new HFile is created. If these files are not merged in time, there will be a huge overhead on the system. Compaction is nothing but the merging of two or more HFiles using the N-way merge sort algorithms, since HFiles are already in a sorted order. Once files are merged, the new file is loaded and the older file is discarded or deleted.
There are different types of compactions; let's look at them now.
Minor compaction takes place on multiple HFiles in HStore. In this type of compaction, a number of adjacent HFiles are picked up, merged, and rewritten into a larger single HFile. When this is done, the deleted or expired files are not removed, they are still present in the resulting HFile. Files to be merged in minor compaction are chosen heuristically. Minor compaction affects the HBase performance and, therefore, there is a limit on the number of files to be merged; by default, it is 10.
Major compaction folds all the HFiles together to form a single HFile. In this type of compaction, the deleted and expired records are discarded, and the active and non-deleted files are kept. Generally, it is manually triggered on large clusters. Major compaction is not a region merge, but it happens with HStore. In this, all the HFiles of a column family are merged. This compaction can also be triggered on an entire table. This is a time-consuming process and an expensive operation; it also affects the performance, so it must be triggered when there are fewer requests to the cluster. Refer to the following figure:
Region split is done by RegionServers. In a RegionServer, once a region becomes overloaded or exceeds the threshold value of 256 MB, it is spliced into new regions. The flow of region splitting is shown in the following figure:
The following is the flow of region splitting, as illustrated in the preceding figure:
.META.
table.This is one of the main tasks of HMaster. Let's see how it works:
.META.
table. If the region assignment is correct and valid, it keeps the region; if region assignment is invalid or incorrect, LoadBalancerFactory
creates DefaultLoadBalancer..META.
table.As new regions are created on a region-size threshold, and since this might result in greater number of regions, it might bring high cost on memory, I/O, and throughput performance. When the RegionServer number threshold is at its maximum (usually, 100 per RegionServer), region merge is initiated by RegionServer. This process flows as follows:
.META.
table, and the new merged region's metadata is updated/added to the .META.
table.When RegionServer fails, the region on the server goes offline and is not available for read or write. Once this happens, and HMaster detects it, the assignment of regions will be made invalid. The region assignment to another RegionServer will be initiated, and it will follow the same steps we discussed in the region assignment process.
All other information on regions and operations on them will be discussed in Chapter 6, HBase Cluster Maintenance and Troubleshooting.
3.145.191.134