Using the same script with a minor modification, we can make one more call and sort the results. The script now looks as follows:
import pyspark
if not 'sc' in globals():
sc = pyspark.SparkContext()
#load in the file
text_file = sc.textFile("Spark Sort Words from File.ipynb")
#split file into sorted, distinct words
sorted_counts = text_file.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.sortByKey()
# print out words found (in sorted order)
for x in sorted_counts.collect():
print(x)
Here, we have added another function call to RDD creation, sortByKey(). So, after we have mapped/reduced, and arrived at a list of words and occurrences, we can then easily sort the results.
The resultning output looks like this: