Storage options

We have come across multiple ways of storing data in previous chapters. In order to appreciate different storage options it would be good to understand these options to some extent.

Hadoop enables us to load data in its natural form using basic HDFS commands, and this can be used for visualizing this data into HIVE or Impala views. We saw this in action in earlier chapters. In addition to these we also used certain Serialization Deserialization (SerDe) adaptors to tell the views how to handle the data.

While handling the RAW files, containing data in its most natural form, the files and data is loaded in Hadoop for further processing. If the data in these RAW files is in a standard format like a CSV file or TAB delimited file it is easier to visualize this data by simply creating a view over it. However we may have data in non standard format as well in these RAW files that necessitates further processing and hence needs additional storage formats to be defines. The natively supported formats at Hadoop layer have been Plain text files (CSV/TAB Delimited Files), sequence files, Avro files and Parquet files as some of the major formats.

The choice of a format needs to be driven by the purpose of data to be stored. A quick comparison of purpose and option analysis can be considered as given here:

Text Format

Sequence File Format

Avro Format

Parquet Format

The text format for instance would provide ease of data loads at an expense of lesser compression and query overheads

Sequence Files are generally used to pack small files that can be used to transfer data between map-reduce jobs

Avro format is a binary format that provides schema based data storage and supports block compression and provides IO gains for faster and more efficient queries

The Parquet storage is another binary format storage that stores data column oriented and is generally useful for queries on specific columns

These formats are the formats supported by HDFS in general, however depending on where the data is stored, there would be variation in patterns of storage as well. In order to better understand this statement, let us consider the two NoSQL data stores that we came across in this book, that is, HBase and Elasticsearch. While both of them belong to NoSQL data store families, each of them employs different ways to store and handle data. HBase is natively non-indexed data store running over HDFS, and stores data as column families, while Elasticsearch is an indexed data stores, which stored data as JSON documents and needs a direct storage mechanism for efficient queries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.62