Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Spark text file analysis

In this example, we will look through a news article to determine some basic information from it.

We will be using the following script against the 2600raid news article (from http://newsitem.com/):

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
sentences = sc.textFile('2600raid.txt') 
    .glom() 
    .map(lambda x: " ".join(x)) 
    .flatMap(lambda x: x.split("."))
print(sentences.count(),"sentences")
bigrams = sentences.map(lambda x:x.split()) 
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
print(bigrams.count(),"bigrams")
frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) 
    .map(lambda x:(x[1],x[0])) 
    .sortByKey(False)
frequent_bigrams.take(10)

The code reads in the article and splits up the article into sentences as determined by ending with a period. From there, the code maps out the bigrams present. A bigram is a pair of words that appear next to each other. We then sort the list and print out the top 10 most prevalent pairs.

When we run this in a notebook, we see these results:

I really had no idea what to expect from the output. It's curious that you can glean some insights into the article as 'the' and 'mall' appear 15 times and 'the' and 'guards' appear 11 times-a raid must have occurred in a mall and included the security guards in some manner!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Spark text file analysis

Create new playlist

Sign In

Sign Up

Spark text file analysis

Table of Contents for
Spark text file analysis