Spark can also access JSON data for manipulation. Here we have an example that:
- Loads a JSON file into a Spark data frame
- Examines the contents of the data frame and displays the apparent schema
- Like the other preceding data frames, moves the data frame into the context for direct access by the Spark session
- Shows an example of accessing the data frame in the Spark context
The listing is as follows:
Our standard includes for Spark:
from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc)
Read in the JSON and display what we found:
#using some data from file from https://gist.github.com/marktyers/678711152b8dd33f6346 df = spark.read.json("people.json") df.show()
I had a difficult time getting a standard JSON to load into Spark. Spark appears to expect one record of data per list of the JSON file versus most JSON I have seen pretty much formats the record layouts with indentation and the like.
Display the interpreted schema for the data:
df.printSchema()
The default for all columns is nullable. You can change an attribute of a column, but you cannot change the value of a column as the data values are immutable.
Move the data frame into the context and access it from there:
df.registerTempTable("people") spark.sql("select name from people").show()
At this point, the people table works like any other temporary SQL table in Spark.