Encoders

Spark 2.x supports a different way of defining schema for complex data types. First, let's look at a simple example.

Encoders must be imported using the import statement in order for you to use Encoders:

import org.apache.spark.sql.Encoders

Let's look at a simple example of defining a tuple as a data type to be used in the dataset APIs:


scala> Encoders.product[(Integer, String)].schema.printTreeString
root
|-- _1: integer (nullable = true)
|-- _2: string (nullable = true)

The preceding code looks complicated to use all the time, so we can also define a case class for our need and then use it. We can define a case class Record with two fields-an Integer and a String:

scala> case class Record(i: Integer, s: String)
defined class Record

Using Encoders , we can easily create a schema on top of the case class, thus allowing us to use the various APIs with ease:

scala> Encoders.product[Record].schema.printTreeString
root
|-- i: integer (nullable = true)
|-- s: string (nullable = true)

All the data types of Spark SQL are located in the package org.apache.spark.sql.types. You can access them by doing:

import org.apache.spark.sql.types._

You should use the DataTypes object in your code to create complex Spark SQL types such as arrays or maps, as follows:

scala> import org.apache.spark.sql.types.DataTypes
import org.apache.spark.sql.types.DataTypes

scala> val arrayType = DataTypes.createArrayType(IntegerType)
arrayType: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,true)

The following are the data types supported in Spark SQL APIs:

Data type Value type in Scala API to access or create a data type
ByteType Byte ByteType
ShortType Short ShortType
IntegerType Int IntegerType
LongType Long LongType
FloatType Float FloatType
DoubleType Double DoubleType
DecimalType java.math.BigDecimal DecimalType
StringType String StringType
BinaryType Array[Byte] BinaryType
BooleanType Boolean BooleanType
TimestampType java.sql.Timestamp TimestampType
DateType java.sql.Date DateType
ArrayType scala.collection.Seq ArrayType(elementType, [containsNull])
MapType scala.collection.Map MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true.
StructType org.apache.spark.sql.Row

StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.247.159