Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a filesystem spread across multiple servers and is designed to run on low-cost commodity hardware. HDFS supports a write-once and read many philosophy. It was designed for large-scale batch processing work on large to enormous sized files.

Files are divided up into blocks. A typical block size is 128 MB. A file on HDFS is sliced up into 128 MB chunks (the blocks) and distributed across different data nodes. Files in HDFS normally range from gigabytes to terabytes.

HDFS was designed for batch processing more than low-latency interactive queries from users. HDFS is not meant for files that frequently change with data updates. New data is typically appended to files or added in new files. Many small files are detrimental to HDFS operations, so files are often combined into a single larger file for performance reasons.

The following diagram shows the general HDFS architecture. The NameNode controls file namespace and is the gatekeeper for file operations across the nodes. It tracks where file blocks are located and handles client requests for reading and update files:

HDFS architecture. Source: Apache Software Foundation

HDFS assumes hardware will fail and files will get lost or corrupted. One of the ways it address this is by keeping multiple copies of the file distributed across nodes. No single node will have all the copies of a file. The number of copies is configurable but the default setting is three. HDFS uses some intelligence on determining location and distribution of file blocks to balance fault tolerance and performance.

If a node fails and file blocks are lost, HDFS automatically generates another copy of the blocks from the remaining copies on other nodes to replace them. These are distributed across the other normally operating nodes. The NameNode keeps track of all these operations. File blocks are also moved around for performance optimization purposes. The actual duplication of file blocks is handled by the data nodes in communication with each other.

To get an idea of how this works, we will describe the high-level process of writing a file. A client, which could be a command-line prompt on the MasterNode or some other software that connects to the cluster, requests a write or read of a file. This request goes to the NameNode, which returns the list of node identifiers where the file blocks can be retrieved or written to. When a file is written, the file blocks are written once by the client request to the nodes in the list. Then, the data nodes duplicate it the set number of times by communicating among themselves.

HDFS file commands are very similar to Linux filesystem commands. The following example instructs HDFS to take the local file called lots_o_data.csv and distribute it across the HDFS cluster in the name space, /user/hadoop/datafolder:

hdfs dfs -put lots_o_data.csv /user/hadoop/datafolder/lots_o_data.csv

The following example would copy the same file from its distributed form in HDFS to the local (non-HDFS single server) file directory:

hdfs dgs -get /user/hadoop/datafolder/lots_o_data.csv lots_o_data.csv

The NameNode abstracts the true distributed location of the pieces of the file. This allows clients to address it with the familiar file folder nomenclature as if it was stored in one file on a single drive. The namespace automatically grows to incorporate new nodes, which add more storage space.

Files can be of any type and do not have to be a standard format. You can have unstructured files, such as word documents, and also very structured files, such as relational database table extracts. HDFS is indifferent to the level of structure in the data that resides in the files.

Many of the tools that interact with HDFS for analysis purposes do rely on structure. The structure can be applied as needed just before a file is read, known as schema-on-read. It could also be defined within the file itself as it is written. The second is what is called schema-on-write.

Schema-on-write is how relational databases, such as Oracle or Microsoft SQL Server, operate. This is a key thing that differentiates HDFS from traditional database systems: HDFS can be used to apply multiple structures to the same dataset (schema-on-read). The downside is that the application reading the dataset must know and apply the structure in order to use it for analytics. There are some open source projects meant to solve this downside for datasets which need a clear structure. We will discuss the two major ones next.

Table of Contents for Hadoop Distributed File System

Create new playlist

Sign In

Sign Up

Table of Contents for
Hadoop Distributed File System