Dataset

Dataset was introduced in Spark 1.6. It is the combination of RDD and dataframe. Dataset brings compile time safety, the object oriented programming style of RDD, and the advances of dataframes together. Therefore, it is an immutable strongly typed object which uses schema to describe the data. It uses the efficient off-heap storage mechanism, Tungsten, and creates optimized query plans that get executed with Spark Catalyst optimizer.

Datasets also introduced the concept of encoders. Encoders work as translators among JVM objects and Spark internal binary format. The tabular representation of data with schema is stored in Spark binary format. Encoders allow operations on serialized data. Spark comes with various inbuilt encoders, along with an encoder API for JavaBean. Encoders allow the access of individual attributes without the need to de-sterilize an entire object. Thus, it reduces serialization efforts and load.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.195.225