Chapter 2. HDFS

In this chapter we will cover:

  • Reading and writing data to HDFS
  • Compressing data using LZO
  • Reading and writing data to SequenceFiles
  • Using Apache Avro to serialize data
  • Using Apache Thrift to serialize data
  • Using Protocol Buffers to serialize data
  • Setting the replication factor for HDFS
  • Setting the block size for HDFS

Introduction

Hadoop Distributed File System (HDFS) is a fault-tolerant distributed filesystem designed to run on "off-the-shelf" hardware. It has been optimized for streaming reads on large files whereas I/O throughput is favored over low latency. In addition, HDFS uses a simple model for data consistency where files can only be written to once.

HDFS assumes disk failure as an eventuality and uses a concept called block replication to replicate data across nodes in the cluster. HDFS uses a much larger block size when compared to desktop filesystems. For example, the default block size for HDFS is 64 MB. Once a file has been placed into HDFS, the file is divided into one or more data blocks and is distributed to nodes in the cluster. In addition, copies of the data blocks are made, which again are distributed to nodes in the cluster to ensure high data availability in case of a disk failure. The number of copies HDFS makes of each data block is determined by the replication factor setting. The default replication factor is 3, meaning three replicas of a data block will be distributed across the nodes in the cluster.

Finally, applications using HDFS can achieve high throughput because the Hadoop framework was designed to move computation to the data. In other words, applications can run on the nodes where the data resides instead of moving the data to the application. This concept is known as data locality.

HDFS consists of three services:

HDFS Application

Purpose

NameNode

This maintains a catalog of all block locations in the cluster

Secondary NameNode

This periodically synchronizes with the NameNode block index. During the synchronizing process, the Secondary NameNode retrieves the current NameNode image and edit logs, merges them together, and then sends the merged image back to the NameNode. The Secondary NameNode is not a "hot" backup of the NameNode. It cannot be used in the event of a NameNode failure.

DataNode

This manages the data blocks it receives from the NameNode. It is unaware of any other DataNodes in the cluster and only communicates with the NameNode.

This chapter will use the FileSystem API, MapReduce, and advanced serialization libraries to efficiently write and store data in HDFS.

Note

Version 0.20.x does not support append operations

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.12.207