Loading JSON into Spark

Spark can also access JSON data for manipulation. Here we have an example that:

Loads a JSON file into a Spark data frame
Examines the contents of the data frame and displays the apparent schema
Like the other preceding data frames, moves the data frame into the context for direct access by the Spark session
Shows an example of accessing the data frame in the Spark context

The listing is as follows:

Our standard includes for Spark:

from pyspark import SparkContext
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

Read in the JSON and display what we found:

#using some data from file from https://gist.github.com/marktyers/678711152b8dd33f6346
df = spark.read.json("people.json")
df.show()

I had a difficult time getting a standard JSON to load into Spark. Spark appears to expect one record of data per list of the JSON file versus most JSON I have seen pretty much formats the record layouts with indentation and the like.

Notice the use of null values where an attribute was not specified in an instance.

Display the interpreted schema for the data:

df.printSchema()

The default for all columns is nullable. You can change an attribute of a column, but you cannot change the value of a column as the data values are immutable.

Move the data frame into the context and access it from there:

df.registerTempTable("people")
spark.sql("select name from people").show()

At this point, the people table works like any other temporary SQL table in Spark.

Table of Contents for Loading JSON into Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading JSON into Spark