LDA

LDA stands for Latent Dirichlet Allocation, and it's one of the widely-used techniques to analyze collections of textual documents.

Note

LDA is an acronym also used by another technique, Linear Discriminant Analysis, which is a supervised method for classification. Pay attention to how LDA is used as there's no connection between these two algorithms.

A full mathematical explanation of LDA would require the knowledge of probabilistic modeling, which is beyond the scope of this practical book. Here, instead, we will give you the most important intuitions behind the model and how to practically apply this model on a massive dataset.

First at all, LDA is used in a branch of data science named text mining, where the focus is on building learners to understand the natural language, for instance, based on textual examples. Specifically, LDA belongs to the category of topic-modeling algorithms as it tries to model the topics included in a document. Ideally, LDA is able to understand whether a document is about finance, politics, or religion, for example. However, differently from a classifier, it is also able to quantify the presence of topics in a document. For example, let's think about a Harry Potter novel, by Rowling. A classifier would be able to assess its category (fantasy novel); LDA, instead, is able to understand how much comedy, drama, mystery, romance, and adventure is in there. Moreover, LDA doesn't require any label; it's an unsupervised method and internally builds the output categories or topic and its composition (that is, given by the set of words composing a topic).

During the processing, LDA builds both a topics-per-document model and a words-per-topic model, modeled as Dirichlet distributions. Although the complexity is high, the processing time needed to output stable results is not all that long, thanks to an iterative Monte Carlo-like core function.

The LDA model is easy to understand: each document is modeled as a distribution of topics, and each topic is modeled as a distribution of words. Distributions assume to have a Dirichlet prior (with different parameters as the number of words per topic are usually different than the number of topics per document). Thanks to Gibbs sampling, distributions shouldn't be directly sampled, but an accurate approximate of it is obtained iteratively. Similar results can be obtained using the variational Bayes technique, where the approximation is generated with an Expectation-Maximization approach.

The resulting LDA model is generative (as happens with Hidden Markov Models, Naïve Bayes, and Restricted Boltzmann Machines), therefore each variable can be simulated and observed.

Let's now see how it works on a real-world dataset—the 20 Newsgroup dataset. It's composed of a collection of e-mails exchanged in 20 newsgroups. Let's initially load it, removing the e-mail headers, footers, and quotes from replied e-mails:

In:from sklearn.datasets import fetch_20newsgroups
documents = fetch_20newsgroups(remove=('headers', 'footers', 
            'quotes'), random_state=101).data

Check the size of the dataset (that is, how many documents), and print one of them to see what one document is actually composed of:

In:len(documents)

Out:11314

In:document_num = 9960
print documents[document_num]

Out:Help!!!

I have an ADB graphicsd tablet which I want to connect to my
Quadra 950. Unfortunately, the 950 has only one ADB port and
it seems I would have to give up my mouse.

Please, can someone help me? I want to use the tablet as well as
the mouse (and the keyboard of course!!!).

Thanks in advance.

As in the example, one guy is looking for help for his video socket on his tablet.

Now, we import the Python packages needed to run LDA. The Gensim package is one of the best ones and, as you'll see at the end of the section, it is also very scalable:

In:import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer

np.random.seed(101)

As the first step, we should clean the text. A few steps are necessary, which is typical of any NLP text processing:

  1. Tokenization is where the text is split into sentences and sentences are split into words. Finally, words are lowercased. At this point, punctuation (and accents) is removed.
  2. Words composed of fewer than three characters are removed. (This step removes most of the acronyms, emoticons, and conjunctions.)
  3. Words appearing in the list of English stopwords are removed. Words in this list are very common and have no predictive power (such as the, an, so, then, have, and so on).
  4. Tokens are then lemmatized; words in third-person are changed to first-person, and verbs in past and future tenses are changed into present (for example, goes, went, and gone all become go).
  5. Finally, stemming removes the inflection, reducing the word to its root (for example, shoes becomes shoe).

In the following piece of code, we will do exactly this: try to clean the text as much as possible and list the words composing each of them. At the end of the cell, we see how this operation changes the document seen previously:

In:lm = WordNetLemmatizer()
stemmer = SnowballStemmer("english")

def lem_stem(text):
    return stemmer.stem(lm.lemmatize(text, pos='v'))

def tokenize_lemmatize(text):
    return [lem_stem(token)
            for token in gensim.utils.simple_preprocess(text) 
            if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3]

print tokenize_lemmatize(documents[document_num])

Out:[u'help', u'graphicsd', u'tablet', u'want', u'connect', u'quadra', u'unfortun', u'port', u'mous', u'help', u'want', u'tablet', u'mous', u'keyboard', u'cours', u'thank', u'advanc']

Now, as the next step, let's operate the cleaning steps on all the documents. After this, we have to build a dictionary containing how many times a word appears in the training set. Thanks to the Gensim package, this operation is straightforward:

In:processed_docs = [tokenize(doc) for doc in documents]
word_count_dict = gensim.corpora.Dictionary(processed_docs)

Now, as we want to build a generic and fast solution, let's remove all the very rare and very common words. For example, we can filter out all the words appearing less than 20 times (in total) and in no more than 20% of the documents:

In:word_count_dict.filter_extremes(no_below=20, no_above=0.2)

As the next step, with such a reduced set of words, we now build the bag-of-words model for each document; that is, for each document, we create a dictionary reporting how many words and how many times the words appear:

In:bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) 
for pdoc in processed_docs]

As an example, let's have a peek at the bag-of-words model of the preceding document:

In:bow_doc1 = bag_of_words_corpus[document_num]

for i in range(len(bow_doc1)):
    print "Word {} ("{}") appears {} time[s]" 
.format(bow_doc1[i][0], 
word_count_dict[bow_doc1[i][0]], bow_doc1[i][1])

Out:Word 178 ("want") appears 2 time[s]
Word 250 ("keyboard") appears 1 time[s]
Word 833 ("unfortun") appears 1 time[s]
Word 1037 ("port") appears 1 time[s]
Word 1142 ("help") appears 2 time[s]
Word 1543 ("quadra") appears 1 time[s]
Word 2006 ("advanc") appears 1 time[s]
Word 2124 ("cours") appears 1 time[s]
Word 2391 ("thank") appears 1 time[s]
Word 2898 ("mous") appears 2 time[s]
Word 3313 ("connect") appears 1 time[s]

Now, we have arrived at the core part of the algorithm: running LDA. As for our decision, let's ask for 12 topics (there are 20 different newsletters, but some are similar):

In:lda_model = gensim.models.LdaMulticore(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=50)

Note

If you get an error with such a code, try to mono-process the version with the gensim.models.LdaModel class instead of gensim.models.LdaMulticore.

Let's now print the topic composition, that is, the words appearing in each topic and their relative weight:

In:for idx, topic in lda_model.print_topics(-1):
    print "Topic:{} Word composition:{}".format(idx, topic)
    print

Out:
Topic:0 Word composition:0.015*imag + 0.014*version + 0.013*avail + 0.013*includ + 0.013*softwar + 0.012*file + 0.011*graphic + 0.010*program + 0.010*data + 0.009*format

Topic:1 Word composition:0.040*window + 0.030*file + 0.018*program + 0.014*problem + 0.011*widget + 0.011*applic + 0.010*server + 0.010*entri + 0.009*display + 0.009*error

Topic:2 Word composition:0.011*peopl + 0.010*mean + 0.010*question + 0.009*believ + 0.009*exist + 0.008*encrypt + 0.008*point + 0.008*reason + 0.008*post + 0.007*thing

Topic:3 Word composition:0.010*caus + 0.009*good + 0.009*test + 0.009*bike + 0.008*problem + 0.008*effect + 0.008*differ + 0.008*engin + 0.007*time + 0.006*high

Topic:4 Word composition:0.018*state + 0.017*govern + 0.015*right + 0.010*weapon + 0.010*crime + 0.009*peopl + 0.009*protect + 0.008*legal + 0.008*control + 0.008*drug

Topic:5 Word composition:0.017*christian + 0.016*armenian + 0.013*jesus + 0.012*peopl + 0.008*say + 0.008*church + 0.007*bibl + 0.007*come + 0.006*live + 0.006*book

Topic:6 Word composition:0.018*go + 0.015*time + 0.013*say + 0.012*peopl + 0.012*come + 0.012*thing + 0.011*want + 0.010*good + 0.009*look + 0.009*tell

Topic:7 Word composition:0.012*presid + 0.009*state + 0.008*peopl + 0.008*work + 0.008*govern + 0.007*year + 0.007*israel + 0.007*say + 0.006*american + 0.006*isra

Topic:8 Word composition:0.022*thank + 0.020*card + 0.015*work + 0.013*need + 0.013*price + 0.012*driver + 0.010*sell + 0.010*help + 0.010*mail + 0.010*look

Topic:9 Word composition:0.019*space + 0.011*inform + 0.011*univers + 0.010*mail + 0.009*launch + 0.008*list + 0.008*post + 0.008*anonym + 0.008*research + 0.008*send

Topic:10 Word composition:0.044*game + 0.031*team + 0.027*play + 0.022*year + 0.020*player + 0.016*season + 0.015*hockey + 0.014*leagu + 0.011*score + 0.010*goal

Topic:11 Word composition:0.075*drive + 0.030*disk + 0.028*control + 0.028*scsi + 0.020*power + 0.020*hard + 0.018*wire + 0.015*cabl + 0.013*instal + 0.012*connect

Unfortunately, LDA doesn't provide a name for each topic; we should do it manuall yourselves, based on our interpretation of the results of the algorithm. After having carefully examined the composition, we can name the discovered topics as follows:

Topic

Name

0

Software

1

Applications

2

Reasoning

3

Transports

4

Government

5

Religion

6

People actions

7

Middle-East

8

PC Devices

9

Space

10

Games

11

Drives

Let's now try to understand what topics are represented in the preceding document and their weights:

In:
for index, score in sorted( 
lda_model[bag_of_words_corpus[document_num]], 
key=lambda tup: -1*tup[1]):
    print "Score: {}	 Topic: {}".format(score, lda_model.print_topic(index, 10))
Out:Score: 0.938887758964     Topic: 0.022*thank + 0.020*card + 0.015*work + 0.013*need + 0.013*price + 0.012*driver + 0.010*sell + 0.010*help + 0.010*mail + 0.010*look

The highest score is associated with the topic PC Devices. Based on our previous knowledge of the collections of documents, it seems that the topic extraction has performed quite well.

Now, let's evaluate the model as a whole. The perplexity (or its logarithm) provides us with a metric to understand how well LDA has performed on the training dataset:

In:print "Log perplexity of the model is", lda_model.log_perplexity(bag_of_words_corpus)

Out:Log perplexity of the model is -7.2985188569

In this case, the perplexity is 2-7.298, and it's connected to the (log) likelihood that the LDA model is able to generate the documents in the test set, given the distribution of topics for those documents. The lower the perplexity, the better the model, because it basically means that the model can regenerate the text quite well.

Now, let's try to use the model on an unseen document. For simplicity, the document contains only the sentences, Golf or tennis? Which is the best sport to play?:

In:unseen_document = "Golf or tennis? Which is the best sport to play?"

bow_vector = word_count_dict.doc2bow(
tokenize_lemmatize(unseen_document))
for index, score in sorted(lda_model[bow_vector], 
key=lambda tup: -1*tup[1]):
    print "Score: {}	 Topic: {}".format(score, 
                lda_model.print_topic(index, 5))

Out:Score: 0.610691655136     Topic: 0.044*game + 0.031*team + 0.027*play + 0.022*year + 0.020*player

Score: 0.222640440339     Topic: 0.018*state + 0.017*govern + 0.015*right + 0.010*weapon + 0.010*crime

As expected, the topic with a higher score is the one about "Games", followed by others with a relatively smaller score.

How does LDA scale with the size of the corpus? Fortunately, very well; the algorithm is iterative and allows online learning, similar to the mini-batch one. The key for the online process is the .update() method offered by LdaModel (or LdaMulticore).

We will do this test on a subset of the original corpus composed of the first 1,000 documents, and we will update our LDA model with batches of 50, 100, 200, and 500 documents. For each mini-batch updating the model, we will record the time and plot them on a graph:

In:small_corpus = bag_of_words_corpus[:1000]
batch_times = {}

for batch_size in [50, 100, 200, 500]:
    print "batch_size =", batch_size
    tik0 = time.time()
    lda_model = gensim.models.LdaModel(num_topics=12, 
                id2word=word_count_dict)
    batch_times[batch_size] = []

    for i in range(0, len(small_corpus), batch_size):
        lda_model.update(small_corpus[i:i+batch_size], 
update_every=25, 
passes=1+500/batch_size)
        batch_times[batch_size].append(time.time() - tik0)

Out:batch_size = 50
batch_size = 100
batch_size = 200
batch_size = 500

Note that we've set the update_every and passes parameters in the model update. This is necessary to make the model converge at each iteration and not return a non-converging model. Note that 500 has been chosen heuristically; if you set it lower, you'll have many warnings from Gensim about the non-convergence of the model.

Let's now plot the results:

In:plt.plot(range(50, 1001, 50), batch_times[50], 'g', 
label='size 50')
plt.plot(range(100, 1001, 100), batch_times[100], 'b', 
label='size 100')
plt.plot(range(200, 1001, 200), batch_times[200], 'k', 
label='size 200')
plt.plot(range(500, 1001, 500), batch_times[500], 'r', 
label='size 500')

plt.xlabel("Training set size")
plt.ylabel("Training time")
plt.xlim([0, 1000])
plt.legend(loc=0)
plt.show()

Out:
LDA

The bigger the batch, the faster the training. (Remember that big batches need fewer passes while updating the model.) On the other hand, the bigger the batch, the greater the amount of memory you need in order to store and process the corpora. Thanks to the mini-batches update method, LDA is able to scale to process a corpora of million documents. In fact, the implementation provided by the Gensim package is able to scale and process the whole of Wikipedia in a couple of hours on a domestic computer. If you're brave enough to try it yourself, here are the complete instructions to accomplish the task, provided by the author of the package:

https://radimrehurek.com/gensim/wiki.html

Scaling LDA – memory, CPUs, and machines

Gensim is very flexible and built to crunch big textual corpora; in fact, this library is able to scale without any modification or additional download:

  1. With the number of CPUs, allowing parallel processes on a single node (with the classes, as seen in the first example).
  2. With the number of observations, allowing online learning based on mini-batches. This can be achieved with the update method available in LdaModel and LdaMulticore (as shown in the previous example).
  3. Running it on a cluster, distributing the workload across the nodes in the cluster, thanks to the Python library Pyro4 and the models.lda_dispatcher (as a scheduler) and models.lda_worker (as a worker process) objects, both provided by Gensim.

Beyond the classical LDA algorithm, Gensim also provides its hierarchical version, named Hierarchical Dirichlet Processing (HDP). Using this algorithm, topics follow a multilevel structure, enabling the user to understand complex corpora better (that is, where some documents are generic and some specific on a topic). This module is fairly new and, as of the end of 2015, not as scalable as the classic LDA.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.74.29