HDFS and formats

Hadoop stores data in blocks of 64/128/256 MB. Hadoop also detects many of the common file formats and deals accordingly when stored. It supports compression, but the compression methodology can support splitting and random seeks, but in a non-splittable format. Hadoop has a number of default codecs for compression. They are as follows:

File-based: It is similar to how you compress various files on your desktop. Some formats support splitting while some don't, but most of these be persisted in Hadoop. This codec compresses the whole file as is, that too, any file format coming its way.
Block-based: As we know, data in Hadoop is stored in blocks, and this codec compresses each block.

However, compression increases CPU utilization and also degrades performance. Hadoop supports a variety of traditional file formats to be stored. However, Hadoop does very specific filesystem for data, as shown:

Text storage (CSV--Comma Separated Values, TSV--Tab Separated Values, JSON--JavaScript Object Notation, and so on): Text files where data is stored in a line with some delimiter at the end to demarcate each record. You can also use well-defined JSON as a record and store it in Hadoop. When using this, it's common to use compression as these formats inherently support this.
Avro: It's a file format with some built-in serialization and deserialization capability. It allows storing simple and complex objects and abstracts many of the complexities away from you and also has many tools at your disposal to be used for easy management of this data. It also supports block-level compression and is one of the favorite Hadoop file formats.
Sequence File: It is designed by MapReduce, so the support by Hadoop is quite extensive. Each record is an encoded key and value that supports block-level compression.
Columnar File format: As the name suggests, it partitions data in horizontal (row) and vertical (column) fashion in the Hadoop system for easy access of subsets of data (data stored for all records in a column, for example). If you plan to query data and want to do slide and dice, this format can be quite handy as against row-only kind of data:
- Parquet: It is most widely used in columnar file format
- RCFile (Record Columnar File): It is the first columnar file format in Hadoop created by Doug Cutting (founder of Hadoop), and has good compression and performance.
- ORC File (Optimized RC File): This is a compressed and optimized RCFile; it compresses and performs better than RCFile

Choose the right format suiting your use case and its requirements. Ensure that when selection is made, some important aspects, such as how you want to read the data and how fast you want it (performance), are to be considered.

Table of Contents for HDFS and formats

Create new playlist

Sign In

Sign Up

Table of Contents for
HDFS and formats