Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Testing in Python

Python testing of Spark is very similar in concept, but the testing libraries are a bit different. PySpark uses both doctest and unittest to test itself. The doctest library makes it easy to create tests based on the expected output of code run in the Python interpreter. We can run the tests by running pyspark -m doctest [pathtocode]. By taking the wordcount.py example from Spark and factoring out countWords, you can test the word count functionality using doctest:

"""
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.parallelize(["pandas are awesome", "and ninjas are also awesome"])
>>> countWords(b)
[('also', 1), ('and', 1), ('are', 2), ('awesome', 2), ('ninjas', 1), ('pandas', 1)]
"""

import sys
from operator import add

from pyspark import SparkContext
def countWords(lines):
    counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add)
    return sorted(counts.collect())


if __name__ == "__main__":
    if len(sys.argv) < 3:
      print >> sys.stderr, "Usage: PythonWordCount<master> <file>"
      exit(-1)
    sc = SparkContext(sys.argv[1], "PythonWordCount")
    lines = sc.textFile(sys.argv[2], 1)
    output = countWords(lines)
    for (word, count) in output:
      print "%s : %i" % (word, count)

We can also test something similar to our Java and Scala programs like so:

"""
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.parallelize(["1,2","1,3"])
>>> handleInput(b)
[3, 4]
"""

import sys
from operator import add

from pyspark import SparkContext
def handleInput(lines):
    data = lines.map(lambda x: sum(map(int, x.split(','))))
    return sorted(data.collect())


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print >> sys.stderr, "Usage: PythonLoadCsv<master> <file>"
        exit(-1)
    sc = SparkContext(sys.argv[1], "PythonLoadCsv")
    lines = sc.textFile(sys.argv[2], 1)
    output = handleInput(lines)
    for sum in output:
      print sum

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Testing in Python

Create new playlist

Sign In

Sign Up

Testing in Python

Table of Contents for
Testing in Python