Using Avro with Spark

So far, we have looked at text-based files. We worked with plain text, JSON, and CSV. JSON and CSV are better than plain text because they carry some schema information.

In this section, we'll be looking at an advanced schema, known as Avro. The following topics will be covered:

  • Saving data in Avro format
  • Loading Avro data
  • Testing

Avro has a schema and data embedded within it. This is a binary format and is not human-readable. We will learn how to save data in Avro format, load it, and then test it.

First, we will create our user transaction:

 test("should save and load avro") {
//given
import spark.sqlContext.implicits._
val rdd = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()

We will then do a coalesce and write an Avro:

 //when
rdd.coalesce(2)
.write
.avro(FileName)

While using CSV, we specified the format like CSV, and, when we specified JSON, this, too, was a format. But in Avro, we have a method. This method is not a standard Spark method; it is from a third-party library. To have Avro support, we need to access build.sbt and add spark-avro support from com.databricks.

We then need to import the proper method. We will import com.databricks.spark.avro._ to give us the implicit function that is extending the Spark DataFrame:

import com.databricks.spark.avro._

We are actually using an Avro method and we can see that implicit class takes a DataFrameWriter class, and writes our data in Spark format.

In the coalesce code we used previously, we can use write, specify the format, and execute a com.databricks.spark.avro class. avro is a shortcut to not write com.databricks.spark.avro as a whole string:

//when
rdd.coalesce(2)
.write.format(com.databricks.spark.avro)
.avro(FileName)

In short, there is no need to specify the format; just apply the implicit avro method.

Let's comment out the code and remove Avro to check how it saves:

// override def afterEach() {
// val path = Path(FileName)
// path.deleteRecursively()
// }

If we open the transactions.avro folder, we have two parts—part-r-00000 and part-r-00001.

The first part will have binary data. It consists of a number of binary records and some human-readable data, which is our schema:

We have two fields—user ID, which is a type string or null, and nameamount, which is an integer. Being a primitive type, JVM cannot have null values. The important thing to note is that, in production systems, we have to save really large datasets, and there will be thousands of records. The schema is always in the first line of every file. If we check the second part as well, we will see that there is exactly the same schema and then the binary data.

Usually, we have only one or more lines if you have a complex schema, but still, it is a very low amount of data.

We can see that in the resulting dataset, we have userID and amount:

+------+------+
|userId|amount|
+------+------+
| a| 100|
| b| 200|
+------+------+

In the preceding code block, we can see that the schema was portrayed in the file. Although it is a binary file, we can extract it.

In the next section, we will be looking at the columnar format—Parquet.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.5.201