Performing capacity planning

Hadoop and HBase were developed to run commodity hardware so that we can have hundreds of commodity machines and configure a Hadoop/HBase cluster. As data becomes costlier or important, we prefer some good machines so as to provide a robust cluster operation.

We have two scenarios—one in which we have many low-end machines, and another in which we have less number of machines for a cluster to be configured. In the first scenario, what we can do is set the replication factor more as we have many machines with storage and memory, and by setting a higher replication of data, we can make sure that data is available even if a machine fails frequently. For this scenario, we must have a good configuration machine that hosts NameNode, because it's a crucial component of the cluster and a proper back-up plan for metadata. In the second scenario, we might have less number of machines, so it is suggested that these machines must be well configured.

The following table shows the typical configuration requirements for machines for a cluster, which also depends upon the use case or the storage requirement of the user:

CPU

Number of nodes

Memory

Storage

Network

8-24 cores

  • One primary node
  • One HMaster
  • One secondary node
  • One back-up HMaster
  • One JobTracker
  • Five to n nodes, TaskTracker, and RegionServers

The number of nodes depends upon the attached storage and amount of data needed to be stored. It is always suggested to have more number of nodes with an average amount of storage attached.

8 GB to 128 GB of memory is required. This depends upon the kind of processing needed; if more primary memory is available, processing will be faster. It also depends upon the daemon processes we need to run on each node.

Storage depends on the number of nodes and data to be stored. We should have a higher number of nodes with low disk storage, for example, 10 nodes with 2 TB of storage attached.

A fiber-cable-based network will be great, but a cable-based network with faster switches will also work.

Machines in cluster, if kept on more than one rack, are always better. Generally, we should have 1 GBPS to 10 GBPS for better network transfers and reduce network congestion.

For better configuration, we can have these HBase/Hadoop daemons hosted on separate machines. It all depends on the use case of the user and the type and amount of data to be stored and processed on the cluster. It can be calculated as: if we have a replication of 3 and 2 TB of data is to be stored, we must have around 10 TB of storage available in the cluster, around 6 TB for storage, and the rest for processing and intermediate temporary files, for which we can have either 5 DataNodes with 2 TB each or 10 DataNodes with 1 TB each, which depends on user preference.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.74.160