Understanding columnar storage

With Apache Spark V2.0, columnar storage was introduced. Many on-disk technologies, such as parquet, or relational databases, such as IBM DB2 BLU or dashDB, support it. So it was an obvious choice to add this to Apache Spark as well. So what is it all about? Consider the following figure:

If we now transpose, we get the following column-based layout:

In contrast to row-based layouts, where fields of individual records are memory-aligned close together, in columnar storage values from similar columns of different records are residing close together in memory. This changes performance significantly. Not only can columnar data such as parquet be read faster by an order of magnitude, columnar storage also benefits when it comes to indexing individual columns or projection operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.76.164