Capacity planning

Suppose we have around 2 TB data with a replication factor of 3, which means 3 * 2 = 6 TB, which in turn means that 2 TB of extra space is still needed. So, for 2 TB of data, we can have a cluster with 4 to 8 DataNodes, totaling 8 TB of storage disk.

This extra space is needed for an intermediate temporary file that is generated during read/write operations and MapReduce jobs. If the data on which we run MapReduce is huge and the MapReduce code processes the whole data that requires a huge HDFS storage to store the temporary and intermediate result files, we will need to provide enough disk storage, the absence of which will result in a lot of failing tasks and blacklisted nodes. It is advisable to have 25 to 50 percent more storage than the original data size (without a replication factor) on the cluster; the minimum should be 25 percent more of the whole data size if we want to run a MapReduce task without much failing.

So, we can apply an approximate formula, as follows (not a universal formula, but might be used to calculate the storage required):

T = (S*R) *1.25 (approximately for intermediate files)

Here, S is the size of data to store on HDFS, R is the replication factor, and T is the total space.

This is the case of a cluster where we need to run MapReduce jobs frequently. For the clusters where we don't need to run jobs, but just store files and read/write, we can have a formula such as S * R + some extra amount of disk space.

So, for example, if we have 2 TB of data, we can calculate the total space A = (2*3) +6*(1/4) which will total to 7.5 (approximately).

Thus, if we assign 8 TB (four nodes with 2 TB each or five nodes with 1.5 TB each) for this cluster, we can ensure proper functioning of the cluster; also, this can be managed as the amount of data grows. We need to add more nodes or storage to the cluster as data grows. This size is the size available for HDFS and not for the whole system, so if we have 2 TB storage for each node, then we must have at least 2.5 or 3 TB for the system.

We can calculate the number of DataNodes in a cluster as Number of DataNodes = (total size required / amount of disk space allocated per nodes) + Disk space for the system (for other resources).

So, if 100 TB is required to be stored, and each node is attached with 5 TB of space, the number of DataNodes will be 100 / 5 = 20 (at least).

If we have compression enabled, the storage requirement might reduce. There is no universal formula; however, we can use the aforementioned formula for an approximate estimation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.67.27