Topic models at scale

For the final Spark example, we will do a simple topic modelling using MLLib (the Spark machine learning library) on our corpus.

We will use nouns as the features for our documents. First we will import the required classes:

from pyspark.mllib.clustering import LDA, LDAModel 
from pyspark.mllib.linalg import Vectors

We will build the vocabulary from the noun word count RDD:

vocabulary = noun_word_count.map(lambda w: w[0]).collect()

Next, we need to transform the chunks corpus into a list of nouns per document:

doc_nouns = chunks  
    .map(lambda chunks: filter( 
            lambda chunk: chunk.part_of_speech == 'NP', 
            chunks 
        ))  
    .filter(lambda chunks: len(chunks) > 0)  
    .map(lambda chunks: list(chain.from_iterable(map( 
            lambda chunk: chunk.words, 
            chunks 
        ))))  
    .map(lambda words: filter( 
            lambda word: match_noun_like_pos(word.part_of_speech), 
            words 
        ))  
    .filter(lambda words: len(words) > 0)  
    .map(lambda words: map( 
            lambda word: word.string.lower(), 
            words, 
        ))

Next, we need to transform the doc_nouns RDD into a vector representation, where the size of the vector is the size of the vocabulary, each index corresponds to the vocabulary item's index.

So, if we have the vocabulary: [paris, tokyo, world], and we have the sentence: Hello World! This is Paris Calling!, the sentence would have the following vector representation: [1, 0, 1]

def get_vector_representation(nouns, vocab): 
    return  Vectors.dense(map( 
        lambda word: 1.0 if word in nouns else 0.0, 
        vocab 
    )) 
 
doc_vecs = doc_nouns  
    .map(lambda nouns: get_vector_representation(set(nouns), vocabulary))  
    .zipWithIndex().map(lambda x: [x[1], x[0]])

The zipWithIndex will create unique ID numbers of each document. The ID will be their index value in the RDD.

Next we train the LDA Model:

   ldaModel = LDA.train(doc_vecs, k=3)

Let's see the features for each topic:

print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):") 
topics = ldaModel.topicsMatrix() 
for topic in range(3): 
    print("Topic " + str(topic) + ":") 
    topic_words = sorted(map( 
        lambda d: (topics[d[0]][topic], d[1]), 
        enumerate(vocabulary) 
    ), reverse=True) 
    for word in topic_words[:10]: 
        print("{}: {}".format(word[1], word[0])) 
    print '-----------'

That should give you the following output:

    Learned topics (as distributions over vocab of 2279 words):
    Topic 0:
    state: 27.6350480347
    city: 18.9516713343
    mr.: 17.5439356649
    president: 16.8568307883
    year: 15.4074257761
    committee: 14.0324129502
    administration: 13.9553862346
    bill: 12.960995307
    election: 11.6073234867
    house: 11.5578186886
    -----------
    Topic 1:
    state: 26.203168362
    administration: 18.269679186
    year: 16.2404114273
    president: 16.0424256301
    city: 14.5047677994
    bill: 13.728963992
    committee: 13.6235523038
    mr.: 13.6074814177
    tax: 12.0525070432
    states: 11.6004234735
    -----------
    Topic 2:
    state: 27.1617836034
    city: 17.5435608663
    president: 16.1007435816
    mr.: 15.8485829174
    administration: 15.7749345795
    states: 14.7379675751
    year: 13.3521627966
    house: 12.5135762168
    election: 12.140906292
    united: 11.8705878794
    -----------

That's it! We have created three clusters using the LDA and nouns extracted from our list of sentences plus this process was completely distributed and scalable!

Table of Contents for Topic models at scale

Create new playlist

Sign In

Sign Up

Table of Contents for
Topic models at scale