Using Spark to analyze data

The first thing to do in order to access Spark is to create a SparkContext. The SparkContext initializes all of Spark and sets up any access that may be needed to Hadoop, if you are using that as well.

The initial object used to be a SQLContext, but that has been deprecated recently in favor of SparkContext, which is more open-ended.

We could use a simple example to just read through a text file as follows:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

lines = sc.textFile("B05238_04 Spark Total Line Lengths.ipynb")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
print(totalLength)  

In this example:

  • We obtain a SparkContext
  • With the context, read in a file (the Jupyter file for this example)
  • We use a Hadoop map function to split up the text file into different lines and gather the lengths
  • We use a Hadoop reduce function to calculate the length of all the lines
  • We display our results

Under Jupyter this looks like the following:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.233.6