Parquet

Apache Parquet is a columnar storage format for data where the structure of the data is incorporated into the file. It is available to any project in the Hadoop ecosystem and is a key format for analytics. It was designed to meet the goals of interoperability, space efficiency, and query efficiency. Parquet files can be stored in HDFS as well as non-HDFS filesystems.

Parquet logo

Columnar storage works well for analytics as the data is stored and arranged by table columns instead of rows. Analytics use cases typically select multiple columns and perform aggregation functions on the values, such as sum, average, or standard deviation. When the data is stored in columns, it is both faster and requires less disk Input/Output (I/O) to read in the requested data.

The data is typically stored in an ordered form making it easy to grab the needed sections and only the data for the selected columns is read. In contrast, a row-oriented format typically requires the entire row to be read in order to get to the necessary column values. Columnar is great for analytics but not all sunshine and roses; it is poor for transactional use cases.

Parquet is intended to support very efficient compression and encoding schemes. It allows different compression schemes for each column in the dataset. It was also designed to be extensible by allowing more encoding to be added as they are invented.

A Parquet file is organized into nested components. A row group contains all data from a division of the dataset; the column data values are guaranteed to be stored next to each other to minimize disk I/O requirements.

Each dataset of the column values is called a column chunk. The column chuck is further divided into pages. The tail end of the file and footer stores the overall metadata on the file and the column chunks. It is at the end so that the write operation that creates the file can be done in one sweep. This is necessary since the full metadata is not known until all of the data is processed into the file. The following diagram represents the divisions within the Parquet file:

Parquet file representation. Source: Apache Parquet documentation

The file metadata holds the information necessary to locate and interpret the data properly. There are three types of metadata, which match up with the key nested components: file metadata, column (chunk) metadata, and page header metadata. The following diagram shows the logical relationships of the metadata parts:

Parquet file metadata detail. Source: Apache Parquet documentation

The data types are intentionally simple to foster more options in frameworks that use it; see the enums box in the previous diagram for the supported types (Type). Strings are stored as binary with a ConvertedType set to UTF8.

The recommended row group size is 512 MB to 1 GB with a larger corresponding HDFS file block size of the same setting. This allows an entire row group to be read with one file block retrieval. The Parquet file extension is .parq.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.107.90