Chapter 4. Optimizing the HBase/Hadoop Cluster

Different workloads have different characteristics, so experiment with different tuning options before finalizing. We cannot achieve the optimum performance for HBase without optimizing Hadoop as HBase runs on top of Hadoop. So, we will first see the optimization parameters of Hadoop and then continue optimizing HBase.

In this chapter, we will discuss the following topics:

  • Hadoop and HBase cluster types
  • Hardware requirements
  • Capacity planning
  • Hardware, network, and operating system considerations
  • The optimization of different components in a cluster configuration
  • Different configuration files in Hadoop/HBase

Setup types for Hadoop and HBase clusters

Now, let's see the files and their parameters in Hadoop, using which optimization can be performed. Mentioned next are examples of Hadoop/HBase cluster types. When we configure a Hadoop/HBase cluster, we can have the following types of clusters, according to their usability:

  • Standalone: This cluster type is suitable for development work where one machine can host all the daemon processes or we have a single machine with many virtual machines on a single system. This type of cluster is good for evaluation and testing purposes.
  • Small: We can have less than or equal to 20 nodes with different processes running of different machines. It is good for small productions with less data and processing requirements.
  • Medium: This cluster type can have 20 to 1000 nodes with HA, three to five ZooKeepers, and DataNodes, which is better for full-fledged production clusters.
  • Large: This cluster type can have 1000 or more nodes with huge storage capacity and several machines for high dataset-processing power, which is best for large-scale setups and processing clusters.

The hardware requirement for these clusters depends on the user scenario. It is best to have more machines with an average amount of storage (GBs to TBs) attached and a more-than-average amount of primary memory (8 GB to 128 GB) for DataNodes. Assigning a huge amount of primary memory for the heap is also not very advisable as lengthy garbage collection might affect the performance.

Let's see an example cluster for the following components:

  • NameNode/HMaster: This is one of the most critical components of a Hadoop cluster, so we need to have this machine as the best and most robust machine. It must not fail as frequently as DataNodes fail, but if NameNodes fail, the whole cluster goes down.

    The recommended hardware configuration for NameNode is as follows:

    • 16 GB to 64 GB memory
    • 2 x (8 to 24) core processor or 2 x 16 core processor
    • SATA ~1 TB disk and one network mounted disk for secondary backup of metadata; this must be of 7200 RPM (Solid State Drives are preferred)
    • 2 x 1 GB Ethernet controller

    As all the metadata is cached in the main memory for faster performance, so the main memory should be of good speed and quality. More memory space means a larger number of files can be hosted by the cluster as it enables bigger namespaces on NameNode.

    Not much storage is required on NameNode, so less disk space is needed. We must have at least two disk locations for the metadata for two copies: one attached to the machine and another network-mounted fixed disk. The metadata is loaded in the main memory and the persistent copy of it with edit logs is kept on a disk.

  • JobTracker: This can run on a NameNode machine or be separately hosted. The hardware configuration requirement will be the same as NameNode. It does not require much storage space as it distributes the job, so not much storage and processing power is required.
  • DataNode/TaskTracker/RegionServers: The actual data resides on these nodes, so we need more storage and processing power for these machines if the number of machines is high. We can have the average storage, memory, and processing power for fewer machines. In the cluster, we need to have more powerful machines, as recommended previously:
    • 24 to 128 GB of memory
    • 2 x (8 to 16/24) core processor
    • 2 x 4 (2 TB disk) with 7200 RPM
    • 2 x 1 GB Ethernet controller
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.12.34