Sorted word count

Using the same script with a minor modification, we can make one more call and sort the results. The script now looks as follows:

import pyspark
if not 'sc' in globals():
sc = pyspark.SparkContext()

#load in the file
text_file = sc.textFile("Spark Sort Words from File.ipynb")

#split file into sorted, distinct words
sorted_counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.sortByKey()

# print out words found (in sorted order)
for x in sorted_counts.collect():
print(x)

Here, we have added another function call to RDD creation, sortByKey(). So, after we have mapped/reduced, and arrived at a list of words and occurrences, we can then easily sort the results.

The resultning output looks like this:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.66.13