Using Spark to analyze data

The first thing to do in order to access Spark is to create a SparkContext. The SparkContext initializes all of Spark and sets up any access that may be needed to Hadoop, if you are using that as well.

The initial object used to be a SQLContext, but that has been deprecated recently in favor of SparkContext, which is more open-ended.

We could use a simple example to just read through a text file as follows:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.textFile("B05238_04 Spark Total Line Lengths.ipynb")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
print(totalLength)

In this example:

We obtain a SparkContext
With the context, read in a file (the Jupyter file for this example)
We use a Hadoop map function to split up the text file into different lines and gather the lengths
We use a Hadoop reduce function to calculate the length of all the lines
We display our results

Under Jupyter this looks like the following:

Table of Contents for Using Spark to analyze data

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Spark to analyze data