Now, let's see how the read-and-write operation takes place in HBase diagrammatically:
Let's discuss and understand how the read-and-write operation takes place in and from HBase tables. In HBase, the client does not write data to HFile directly; it is first written to WAL and then to HBase MemStore, which is shared by an HStore in the main memory and then flushed to HFile later. Refer to the following figure:
Write-Ahead Logs facilitate the data reliability and reside on HDFS; each RegionServer hosts a single WAL. In the case of a RegionServer crash where MemStore is not flushed, WAL is used to restore the data to a new RegionServer. So, only once data is written successfully to WAL and MemStore, the write operation is said to be successful.
MemStore acts as an in-memory write buffer with a default size of 64 MB. Once data in MemStore reaches the threshold (which is by default 40 percent of the heap size or 64 MB), it is flushed to a new HFile on HDFS for persistence. The 64 MB HFile is not related to block size here; Hadoop internally manages block allocation and storage. HBase does not play a role in the underlying mechanism of block replication or dividing HFiles into blocks. Each column family might have many HFiles, but the HFile will only belong to a specific column family.
Now, let's take a look at the process flow of reading from HBase. The reading process starts when the client initiates a read request; the client gets the RegionServer and region information, and it communicates this to the acquired RegionServer. At the acquired RegionServer, the client first tries to read from MemStore; if hit, the read activity completes; if it's a miss, it navigates to block cache. Finally, it reaches out to HFile to read the required row of data. If there is a missing record, the corresponding HFile is loaded into the memory that contains the required row of data. So, MemStore and block cache provide real-time access to data for performance purposes, and HFile provides persistent, on-demand data.
Block cache follows the least recently used (LRU) algorithm. Every RegionServer has a single block cache that keeps the most frequently accessed data from HFile in the main memory, which results in reducing the disk seek for data access time.
18.223.172.132