Hadoop data storage formats

Since Hadoop involves storing very large-scale data, it is essential to select a storage type that is appropriate for your use cases. There are a few formats in which data can be stored in Hadoop, and the selection of the optimal storage format depends on your requirements in terms of read/write I/O speeds, how well the files can be compressed and decompressed on demand, and how easily the file can be split since the data will be eventually stored as blocks.

Some of the popular and commonly used storage formats are as follows:

Text/CSV: These are plain text CSV files, similar to Excel files, but saved in plain text format. Since CSV files contain records on a per-line basis, it is naturally trivial to split the files up into blocks of data.
Avro: Avro was developed to improve the efficient sharing of data across heterogeneous systems. It stores both the schema as well as the actual data in a single compact binary using data serialization. Avro uses JSON to store the schema and binary format for the data and serializes them into a single Avro Object Container File. Multiple languages such as Python, Scala, C/C++, and others have native APIs that can read Avro files and consequently, it is very portable and well suited for cross-platform data exchange.
Parquet: Parquet is a columnar data storage format. This helps to improve performance, sometimes significantly by permitting data storage and access on a per-column basis. Intuitively, if you were working on a 1 GB file with 100 columns and 1 million rows, and wanted to query data from only one of the 100 columns, being able to access just the individual column would be more efficient than having to access the entire file.
ORCFiles: ORC stands for Optimized Row-Columnar. In a sense, it is a further layer of optimization over pure columnar formats such as Parquet. ORCFiles store data not only by columns, but also by rows, also known as stripes. A file with data in tabular format can thus be split into multiple smaller stripes where each stripe comprises of a subset of rows from the original file. By splitting data in this manner, if a user task requires access to only a small subsection of the data, the process can interrogate the specific stripe that holds the data.
SequenceFiles: In SequenceFiles, data is represented as key-value pairs and stored in a binary serialized format. Due to serialization, data can be represented in a compact binary format that not only reduces the data size but consequently also improves I/O. Hadoop, and more concretely HDFS, is not efficient when there are multiple files of a small size, such as audio files. SequenceFiles solve this problem by allowing multiple small files to be stored as a single unit or SequenceFile. They are also very well suited for parallel operations that are splittable and are overall efficient for MapReduce jobs.
HDFS Snapshots: HDFS Snapshots allow users to preserve data at a given point in time in a read-only mode. Users can create snapshots—in essence a replica of the data as it is at that point time—in in HDFS, such that they can be retrieved at a later stage as and when needed. This ensures that data can be recovered in the event that there is a file corruption or any other failure that affects the availability of data. In that regard, it can be also considered to be a backup. The snapshots are available in a .snapshot directory where they have been created.
Handling of node failures: Large Hadoop clusters can contain tens of thousands of nodes. Hence it is likely that there would be server failures on any given day. So that the NameNode is aware of the status of all nodes in the cluster, the DataNodes send a periodic heartbeat to the NameNode. If the NameNode detects that a server has failed, that is, it has stopped receiving heartbeats, it marks the server as failed and replicates all the data that was local to the server onto a new instance.

Table of Contents for Hadoop data storage formats

Create new playlist

Sign In

Sign Up

Table of Contents for
Hadoop data storage formats