Using Datasets

This API as been introduced since Apache Spark 1.6 as experimental and finally became a first-class citizen in Apache Spark 2.0. It is basically a strongly typed version of DataFrames.

DataFrames are kept for backward compatibility and are not going to be deprecated for two reasons. First, a DataFrame since Apache Spark 2.0 is nothing else but a Dataset where the type is set to Row. This means that you actually lose the strongly static typing and fall back to a dynamic typing. This is also the second reason why DataFrames are going to stay. Dynamically typed languages such as Python or R are not capable of using Datasets because there isn't a concept of strong, static types in the language.

So what are Datasets exactly? Let's create one:

import spark.implicits._
case class Person(id: Long, name: String)
 val caseClassDS = Seq(Person(1,"Name1"),Person(2,"Name2")).toDS()

As you can see, we are defining a case class in order to determine the types of objects stored in the Dataset. This means that we have a strong, static type here that is clear at compile time already. So no dynamic type inference is taking place there. This allows for a lot of further performance optimization and also adds compile type safety to your applications. We'll cover the performance optimization aspect in more detail in the Chapter 3, The Catalyst Optimizer, and Chapter 4, Project Tungsten. As you have seen before, DataFrames can be created from an RDD containing Row objects. These also have a schema. Note that the difference between Datasets and DataFrame is that the Row objects are not static types as the schema can be created during runtime by passing it to the constructor using StructType objects (refer to the last section on DataFrames). As mentioned before, a DataFrame-equivalent Dataset would contain only elements of the Row type. However, we can do better. We can define a static case class matching the schema of our client data stored in client.json:

 case class Client(
      age: Long,
      countryCode: String,
      familyName: String,
      id: String,
      name: String
      )

Now we can reread our client.json file but this time, we convert it to a Dataset:

val ds = spark.read.json("/Users/romeokienzler/Documents/romeo/Dropbox/arbeit/spark/sparkbuch/mywork/chapter2/client.json").as[Client]

Now we can use it similarly to DataFrames, but under the hood, typed objects are used to represent the data:

If we print the schema, we get the same as the formerly used DataFrame:

Table of Contents for Using Datasets

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Datasets