Hadoop Distributed File System - HDFS

The HDFS forms the underlying basis of all Hadoop installations. Files, or more generally data, is stored in HDFS and accessed by the nodes of Hadoop.

HDFS performs two main functions:

  • Namespaces: Provides namespaces that hold cluster metadata, that is, the location of data in the Hadoop cluster
  • Data storage: Acts as storage for data used in the Hadoop cluster

The filesystem is termed as distributed since the data is stored in chunks across multiple servers. An intuitive understanding of HDFS can be gained from a simple example, as follows. Consider a large book that consists of Chapters A - Z. In ordinary filesystems, the entire book would be stored as a single file on the disk. In HDFS, the book would be split into smaller chunks, say a chunk for Chapters A - H, another for I - P, and a third one for Q - Z. These chunks are then stored in separate racks (or bookshelves as with this analogy). Further, the chapters are replicated three times, such that there are three copies of each of the chapters.

Suppose, further, the size of the entire book is 1 GB, and each chapter is approximately 350 MB:

A bookshelf analogy for HDFS

Storing the book in this manner achieves a few important objectives:

  • Since the book has been split into three parts by groups of chapters and each part has been replicated three times, it means that our process can read the book in parallel by querying the parts from different servers. This reduces I/O contention and is a very fitting example of the proper use of parallelism.
  • If any of the racks are not available, we can retrieve the chapters from any of the other racks as there are multiple copies of each chapter available on different racks.
  • If a task I have been given only requires me to access a single chapter, for example, Chapter B, I need to access only the file corresponding to Chapters A-H. Since the size of the file corresponding to Chapters A-H is a third the size of the entire book, the time to access and read the file would be much smaller.
  • Other benefits, such as selective access rights to different chapter groups and so on, would also be possible with such a model.

This may be an over-simplified analogy of the actual HDFS functionality, but it conveys the basic principle of the technology - that large files are subdivided into blocks (chunks) and spread across multiple servers in a high-availability redundant configuration. We'll now look at the actual HDFS architecture in a bit more detail:

The HDFS backend of Hadoop consists of:

  • NameNode: This can be considered the master node. The NameNode contains cluster metadata and is aware of what data is stored in which location - in short, it holds the namespace. It stores the entire namespace in RAM and when a request arrives, provides information on which servers hold the data required for the task. In Hadoop 2, there can be more than one NameNode. A secondary NameNode can be created that acts as a helper node to the primary. As such, it is not a backup NameNode, but one that helps in keeping cluster metadata up to date.
  • DataNode: The DataNodes are the individual servers that are responsible for storing chunks of the data and performing compute operations when they receive a new request. These are primarily commodity servers that are less powerful in terms of resource and capacity than the NameNode that stores the cluster metadata.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.12.156