We'll start off with our boilerplate Spark stuff that creates a local SparkConfiguration and a SparkContext, from which we can then create our initial RDD.
conf = SparkConf().setMaster("local").setAppName("SparkTFIDF") sc = SparkContext(conf = conf)
Next, we're going to use our SparkContext to create an RDD from subset-small.tsv.
rawData = sc.textFile("e:/sundog-consult/Udemy/DataScience/subset-small.tsv")
This is a file containing tab-separated values, and it represents a small sample of Wikipedia articles. Again, you'll need to change your path as shown in the preceding code as necessary for wherever you installed the course materials for this book.
That gives me back an RDD where every document is in each line of the RDD. The tsv file contains one entire Wikipedia document on every line, and I know that each one of those documents is split up into tabular fields that have various bits of metadata about each article.
The next thing I'm going to do is split those up:
fields = rawData.map(lambda x: x.split(" "))
I'm going to split up each document based on their tab delimiters into a Python list, and create a new fields RDD that, instead of raw input data, now contains Python lists of each field in that input data.
Finally, I'm going to map that data, take in each list of fields, extract field number three x[3], which I happen to know is the body of the article itself, the actual article text, and I'm in turn going to split that based on spaces:
documents = fields.map(lambda x: x[3].split(" "))
What x[3] does is extract the body of the text from each Wikipedia article, and split it up into a list of words. My new documents RDD has one entry for every document, and every entry in that RDD contains a list of words that appear in that document. Now, we actually know what to call these documents later on when we're evaluating the results.
I'm also going to create a new RDD that stores the document names:
documentNames = fields.map(lambda x: x[1])
All that does is take that same fields RDD and uses this map function to extract the document name, which I happen to know is in field number one.
So, I now have two RDDs, documents, which contains lists of words that appear in each document, and documentNames, which contains the name of each document. I also know that these are in the same order, so I can actually combine these together later on to look up the name for a given document.