Chapter 2. HDFS and MapReduce

We now have a basic understanding of the Apache Hadoop architecture and its inner workings. In this chapter, we will dive deeper into the two major components of Apache Hadoop—HDFS and MapReduce, and will cover the following topics:

  • Essentials of Hadoop Distributed File System
  • The read/write operational flow in HDFS
  • Exploring HDFS commands
  • Getting acquainted with MapReduce

Essentials of HDFS

HDFS is a distributed filesystem that has been designed to run on top of a cluster of industry standard hardware. The architecture of HDFS is such that there is no specific need for high-end hardware. HDFS is a highly fault-tolerant system and can handle failures of nodes in a cluster without loss of data. The primary goal behind the design of HDFS is to serve large data files efficiently. HDFS achieves this efficiency and high throughput in data transfer by enabling streaming access to the data in the filesystem.

The following are the important features of HDFS:

  • Fault tolerance: Many computers working together as a cluster are bound to have hardware failures. Hardware failures such as disk failures, network connectivity issues, and RAM failures could disrupt processing and cause major downtime. This could lead to data loss as well slippage of critical SLAs. HDFS is designed to withstand such hardware failures by detecting faults and taking recovery actions as required.

    The data in HDFS is split across the machines in the cluster as chunks of data called blocks. These blocks are replicated across multiple machines of the cluster for redundancy. So, even if a node/machine becomes completely unusable and shuts down, the processing can go on with the copy of the data present on the nodes where the data was replicated.

  • Streaming data: Streaming access enables data to be transferred in the form of a steady and continuous stream. This means if data from a file in HDFS needs to be processed, HDFS starts sending the data as it reads the file and does not wait for the entire file to be read. The client who is consuming this data starts processing the data immediately, as it receives the stream from HDFS. This makes data processing really fast.
  • Large data store: HDFS is used to store large volumes of data. HDFS functions best when the individual data files stored are large files, rather than having large number of small files. File sizes in most Hadoop clusters range from gigabytes to terabytes. The storage scales linearly as more nodes are added to the cluster.
  • Portable: HDFS is a highly portable system. Since it is built on Java, any machine or operating system that can run Java should be able to run HDFS. Even at the hardware layer, HDFS is flexible and runs on most of the commonly available hardware platforms. Most production level clusters have been set up on commodity hardware.
  • Easy interface: The HDFS command-line interface is very similar to any Linux/Unix system. The commands are similar in most cases. So, if one is comfortable with the performing basic file actions in Linux/Unix, using commands with HDFS should be very easy.

The following two daemons are responsible for operations on HDFS:

  • Namenode
  • Datanode

In Chapter 1, Getting Started with Apache Hadoop, we already covered the details on how the namenode and datanodes daemons work together to store files in the cluster. These daemons talk to each other via TCP/IP.

Configuring HDFS

All HDFS-related configuration is done by adding/updating the properties in the hdfs-site.xml file that is found in the conf folder under the Hadoop installation folder.

The following are the different properties that are part of the hdfs-site.xml file:

  • dfs.namenode.servicerpc-address: This specifies the unique namenode RPC address in the cluster. Services/daemons such as the secondary namenode and datanode daemons use this address to connect to the namenode daemon whenever it needs to communicate. This property is shown in the following code snippet:
    <property>
      <name>dfs.namenode.servicerpc-address</name>
      <value>node1.hcluster:8022</value>
    </property>
  • dfs.namenode.http-address: This specifies the URL that can be used to monitor the namenode daemon from a browser. This property is shown in the following code snippet:
    <property>
      <name>dfs.namenode.http-address</name>
      <value>node1.hcluster:50070</value>
    </property>
  • dfs.replication: This specifies the replication factor for data block replication across the datanode daemons. The default is 3 as shown in the following code snippet:
    <property>
      <name>dfs.replication</name>
      <value>3</value>
    </property>
  • dfs.blocksize: This specifies the block size. In the following example, the size is specified in bytes (134,217,728 bytes is 128 MB):
    <property>
      <name>dfs.blocksize</name>
      <value>134217728</value>
    </property>
  • fs.permissions.umask-mode: This specifies the umask value that will be used when creating files and directories in HDFS. This property is shown in the following code snippet:
    <property>
      <name>fs.permissions.umask-mode</name>
      <value>022</value>
    </property>

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.28.70