Recommendations for CDH cluster configuration

The hardware specification might vary according to the amount of data to be stored and the type of processing power required. It is recommended to use the following configurations:

  • 1 to 4 TB hard disks
  • Two (8 to 24 core) processors, running at least 2 to 2.5 GHz
  • 64 to 512 GB of memory
  • Bonded Gigabit Ethernet or 10 Gigabit Ethernets

Now, let's explain these hardware components in more detail:

  • CPU: The workload depends on this hardware component. It is recommended that we have a medium-clock-speed CPU with two slots for DataNodes. Why medium? This is because the high-end processor cost of a setup rises quickly, so we can have a comparatively cheaper CPU with more machines than use fewer machines with high-end processors. So, it is recommended to have 8 to 24 core processors with medium CPU cycle for less power consumption.
  • Power: This is also a component to consider when configuring a Hadoop cluster because power consumption tends to go up with high-end or a higher number of machines, and hence increases the cost to maintain air cooling and the environment setup. There must be a constant power supply with failover for constant operations. Also, there must be proper air conditioning for the cluster environment.
  • RAM: We need only the amount of memory that will be sufficient to keep processors busy in processing instead of keeping it waiting for data to be brought to the main memory for processing. So, 8 GB to 48 GB of RAM looks adequate for a system inside a cluster. HBase tends to use a lot of memory and keeps files in the main memory (if in-memory tables are enabled). So, for clusters subhosting HBase, we can consider more memory than what we have for Hadoop-only clusters. If caching is enabled in HBase, the entire table is tried to be kept in the main memory, so depending upon the components (DataNode, RegionServer, and TaskTracker) being hosted, we might have to add or reduce RAM. Whatever requirements we specify are a global resource not only for the Hadoop/HBase heap but also for the system. If we have more memory in the system, we can change the heap memory on requirement.
  • Disk: The disk must have high-speed (7200 RPM) SATA drives. This disk storage varies with the amount of data we need to have on the cluster and number of machines. It is not advisable to have a machine loaded with huge disk space because if it fails, there will be a huge overhead in re-replicating the blocks. We can have machines with locally attached disks, but they must also have network-attached disks. So, for local disks, if a machine fails, the data can't be used, but if it is network attached, the same disk can be attached with other machines (newly configured) and be used to report the data to NameNode. As an example, we can say that we can have one to four disks attached to machines of capacities calculated on the basis of data storage requirement if we have Solid State Drives (SSDs) that also boost the throughput a lot.
  • Network: Hadoop/HBase tends to transfer data between nodes while running tasks, accessing data, or writing data to the cluster, so it is advisable to have a high-speed network between the nodes, and also a high-speed network switch. For a small or medium cluster, 1 GB/s network is enough to do the work; for bigger clusters, 10 GB/s network is preferable. The network load also depends on the type of analytical computing in the cluster. Some operations, such as sorting and shuffling, tend to transfer a lot of data between the nodes, and hence, the bandwidth matters a lot; if adequate bandwidth is not available, there will be more timeout errors and issues such as RegionServer failing, ZooKeeper timeouts, bad connection error, and no route to host errors. For smaller clusters, it's better to have a single switch for better performance; for bigger clusters, we can have multiple fast switches.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.74.160