For the final Spark example, we will do a simple topic modelling using MLLib (the Spark machine learning library) on our corpus.
We will use nouns as the features for our documents. First we will import the required classes:
from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors
We will build the vocabulary from the noun word count RDD:
vocabulary = noun_word_count.map(lambda w: w[0]).collect()
Next, we need to transform the chunks corpus into a list of nouns per document:
doc_nouns = chunks .map(lambda chunks: filter( lambda chunk: chunk.part_of_speech == 'NP', chunks )) .filter(lambda chunks: len(chunks) > 0) .map(lambda chunks: list(chain.from_iterable(map( lambda chunk: chunk.words, chunks )))) .map(lambda words: filter( lambda word: match_noun_like_pos(word.part_of_speech), words )) .filter(lambda words: len(words) > 0) .map(lambda words: map( lambda word: word.string.lower(), words, ))
Next, we need to transform the doc_nouns RDD into a vector representation, where the size of the vector is the size of the vocabulary, each index corresponds to the vocabulary item's index.
So, if we have the vocabulary: [paris, tokyo, world], and we have the sentence: Hello World! This is Paris Calling!, the sentence would have the following vector representation: [1, 0, 1]
def get_vector_representation(nouns, vocab): return Vectors.dense(map( lambda word: 1.0 if word in nouns else 0.0, vocab )) doc_vecs = doc_nouns .map(lambda nouns: get_vector_representation(set(nouns), vocabulary)) .zipWithIndex().map(lambda x: [x[1], x[0]])
Next we train the LDA Model:
ldaModel = LDA.train(doc_vecs, k=3)
Let's see the features for each topic:
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):") topics = ldaModel.topicsMatrix() for topic in range(3): print("Topic " + str(topic) + ":") topic_words = sorted(map( lambda d: (topics[d[0]][topic], d[1]), enumerate(vocabulary) ), reverse=True) for word in topic_words[:10]: print("{}: {}".format(word[1], word[0])) print '-----------'
That should give you the following output:
Learned topics (as distributions over vocab of 2279 words): Topic 0: state: 27.6350480347 city: 18.9516713343 mr.: 17.5439356649 president: 16.8568307883 year: 15.4074257761 committee: 14.0324129502 administration: 13.9553862346 bill: 12.960995307 election: 11.6073234867 house: 11.5578186886 ----------- Topic 1: state: 26.203168362 administration: 18.269679186 year: 16.2404114273 president: 16.0424256301 city: 14.5047677994 bill: 13.728963992 committee: 13.6235523038 mr.: 13.6074814177 tax: 12.0525070432 states: 11.6004234735 ----------- Topic 2: state: 27.1617836034 city: 17.5435608663 president: 16.1007435816 mr.: 15.8485829174 administration: 15.7749345795 states: 14.7379675751 year: 13.3521627966 house: 12.5135762168 election: 12.140906292 united: 11.8705878794 -----------
That's it! We have created three clusters using the LDA and nouns extracted from our list of sentences plus this process was completely distributed and scalable!