Spark text file analysis

In this example, we will look through a news article to determine some basic information from it.

We will be using the following script against the 2600raid news article (from http://newsitem.com/):

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
sentences = sc.textFile('2600raid.txt') 
    .glom() 
    .map(lambda x: " ".join(x)) 
    .flatMap(lambda x: x.split("."))
print(sentences.count(),"sentences")
bigrams = sentences.map(lambda x:x.split()) 
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
print(bigrams.count(),"bigrams")
frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) 
    .map(lambda x:(x[1],x[0])) 
    .sortByKey(False)
frequent_bigrams.take(10)

The code reads in the article and splits up the article into sentences as determined by ending with a period. From there, the code maps out the bigrams present. A bigram is a pair of words that appear next to each other. We then sort the list and print out the top 10 most prevalent pairs.

When we run this in a notebook, we see these results:

Spark text file analysis

I really had no idea what to expect from the output. It's curious that you can glean some insights into the article as 'the' and 'mall' appear 15 times and 'the' and 'guards' appear 11 times-a raid must have occurred in a mall and included the security guards in some manner!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.44.174