Complex data types – arrays, maps, and structs

So far, all the elements in our DataFrames were simple types. DataFrames support three additional collection types: arrays, maps, and structs.

Structs

The first compound type that we will look at is the struct. A struct is similar to a case class: it stores a set of key-value pairs, with a fixed set of keys. If we convert an RDD of a case class containing nested case classes to a DataFrame, Spark will convert the nested objects to a struct.

Let's imagine that we want to serialize Lords of the Ring characters. We might use the following object model:

case class Weapon(name:String, weaponType:String)
case class LotrCharacter(name:String, val weapon:Weapon)

We want to create a DataFrame of LotrCharacter instances. Let's create some dummy data:

scala> val characters = List(
  LotrCharacter("Gandalf", Weapon("Glamdring", "sword")),
  LotrCharacter("Frodo", Weapon("Sting", "dagger")),
  LotrCharacter("Aragorn", Weapon("Anduril", "sword"))
)
characters: List[LotrCharacter] = List(LotrCharacter...

scala> val charactersDF = sc.parallelize(characters).toDF
charactersDF: DataFrame = [name: string, weapon: struct<name:string,weaponType:string>]

scala> charactersDF.printSchema
root
 |-- name: string (nullable = true)
 |-- weapon: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- weaponType: string (nullable = true)

scala> charactersDF.show
+-------+-----------------+
|   name|           weapon|
+-------+-----------------+
|Gandalf|[Glamdring,sword]|
|  Frodo|   [Sting,dagger]|
|Aragorn|  [Anduril,sword]|
+-------+-----------------+

The weapon attribute in the case class was converted to a struct column in the DataFrame. To extract sub-fields from a struct, we can pass the field name to the column's .apply method:

scala> val weaponTypeColumn = charactersDF("weapon")("weaponType")
weaponTypeColumn: org.apache.spark.sql.Column = weapon[weaponType]

We can use this derived column just as we would any other column. For instance, let's filter our DataFrame to only contain characters who wield a sword:

scala> charactersDF.filter { weaponTypeColumn === "sword" }.show
+-------+-----------------+
|   name|           weapon|
+-------+-----------------+
|Gandalf|[Glamdring,sword]|
|Aragorn|  [Anduril,sword]|
+-------+-----------------+

Arrays

Let's return to the earlier example, and assume that, besides height, weight, and age measurements, we also have phone numbers for our patients. Each patient might have zero, one, or more phone numbers. We will define a new case class and new dummy data:

scala> case class PatientNumbers(
  patientId:Int, phoneNumbers:List[String])
defined class PatientNumbers

scala> val numbers = List(
  PatientNumbers(1, List("07929123456")),
  PatientNumbers(2, List("07929432167", "07929234578")),
  PatientNumbers(3, List.empty),
  PatientNumbers(4, List("07927357862"))
)

scala> val numbersDF = sc.parallelize(numbers).toDF
numbersDF: org.apache.spark.sql.DataFrame = [patientId: int, phoneNumbers: array<string>]

The List[String] array in our case class gets translated to an array<string> data type:

scala> numbersDF.printSchema
root
 |-- patientId: integer (nullable = false)
 |-- phoneNumbers: array (nullable = true)
 |    |-- element: string (containsNull = true)

As with structs, we can construct a column for a specific index the array. For instance, we can select the first element in each array:

scala> val bestNumberColumn = numbersDF("phoneNumbers")(0)
bestNumberColumn: org.apache.spark.sql.Column = phoneNumbers[0]

scala> numbersDF.withColumn("bestNumber", bestNumberColumn).show
+---------+--------------------+-----------+
|patientId|        phoneNumbers| bestNumber|
+---------+--------------------+-----------+
|        1|   List(07929123456)|07929123456|
|        2|List(07929432167,...|07929432167|
|        3|              List()|       null|
|        4|   List(07927357862)|07927357862|
+---------+--------------------+-----------+

Maps

The last compound data type is the map. Maps are similar to structs inasmuch as they store key-value pairs, but the set of keys is not fixed when the DataFrame is created. They can thus store arbitrary key-value pairs.

Scala maps will be converted to DataFrame maps when the DataFrame is constructed. They can then be queried in a manner similar to structs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.71.106