Leveraging JSON as a data format

In this section, we will leverage JSON as a data format and save our data in JSON. The following topics will be covered:

  • Saving data in JSON format
  • Loading JSON data
  • Testing

This data is human-readable and gives us more meaning than simple plain text because it carries some schema information, such as a field name. We will then learn how to save data in JSON format and load our JSON data.

We will first create a DataFrame of UserTransaction("a", 100) and UserTransaction("b", 200), and use .toDF() to save the DataFrame API:

val rdd = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()

We will then issue coalesce() and, this time, we will take the value as 2, and we will have two resulting files. We will then issue the write.format method and, for the same, we need to specify a format, for which we will use the json format:

rdd.coalesce(2).write.format("json").save(FileName)

If we use the unsupported format, we will get an exception. Let's test this by entering the source as not:

rdd.coalesce(2).write.format("not").save(FileName)

We will get exceptions such as 'This format is not expected', 'Failed to find data source: not', and 'There is no such data source':

In our original JSON code, we will specify the format and we need to save it to FileName. If we want to read, we need to specify it as read mode and also add a path to the folder:

val fromFile = spark.read.json(FileName)

On this occasion, let's comment out afterEach() to investigate the produced JSON:

// override def afterEach() {
// val path = Path(FileName)
// path.deleteRecursively()
// }

Let's start the test:

 fromFile.show()
assert(fromFile.count() == 2)
}
}

The output is as follows:

+------+------+
|amount|userId|
| 200| b|
| 100| a|
+------+------+

In the preceding code output, we can see that our test passed and that the DataFrame includes all the meaningful data.

From the output, we can see that DataFrame has all the schema required. It has amount and userId, which is very useful.

The transactions.json folder has two parts—one part is r-00000, and the other part is r-00001, because we issued two partitions. If we save the data in a production system with 100 partitions, we will end up with 100 part files and, furthermore, every part file will have a CRC checksum file.

This is the first file:

{"userId":"a","amount":"100"}

Here, we have a JSON file with schema and, hence, we have a userID field and amount field.

On the other hand, we have the second file with the second record with userID and amount as well:

{"userId":"b","amount":"200"}

The advantage of this is that Spark is able to infer the data from the schema and is loaded in a formatted DataFrame with proper naming and types. The disadvantage, however, is that every record has some additional overhead. Every record needs to have a string in it and, in each string, if we have a file that has millions of files and we are not compressing it, there will be substantial overhead, which is not ideal.

JSON is human-readable but, on the other hand, it consumes a lot of resources, just like the CPU for compressing, reading, and writing, and also the disk and memory for the overhead. Apart from JSON, there are better formats that we will cover in the following sections.

In the next section, we will look at the tabular format, where we will cover a CSV file that is often used to import to Microsoft Excel or Google spreadsheet. This is also a very useful format for data scientists, but only while using smaller datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.205.136