LDA stands for Latent Dirichlet Allocation, and it's one of the widely-used techniques to analyze collections of textual documents.
A full mathematical explanation of LDA would require the knowledge of probabilistic modeling, which is beyond the scope of this practical book. Here, instead, we will give you the most important intuitions behind the model and how to practically apply this model on a massive dataset.
First at all, LDA is used in a branch of data science named text mining, where the focus is on building learners to understand the natural language, for instance, based on textual examples. Specifically, LDA belongs to the category of topic-modeling algorithms as it tries to model the topics included in a document. Ideally, LDA is able to understand whether a document is about finance, politics, or religion, for example. However, differently from a classifier, it is also able to quantify the presence of topics in a document. For example, let's think about a Harry Potter novel, by Rowling. A classifier would be able to assess its category (fantasy novel); LDA, instead, is able to understand how much comedy, drama, mystery, romance, and adventure is in there. Moreover, LDA doesn't require any label; it's an unsupervised method and internally builds the output categories or topic and its composition (that is, given by the set of words composing a topic).
During the processing, LDA builds both a topics-per-document model and a words-per-topic model, modeled as Dirichlet distributions. Although the complexity is high, the processing time needed to output stable results is not all that long, thanks to an iterative Monte Carlo-like core function.
The LDA model is easy to understand: each document is modeled as a distribution of topics, and each topic is modeled as a distribution of words. Distributions assume to have a Dirichlet prior (with different parameters as the number of words per topic are usually different than the number of topics per document). Thanks to Gibbs sampling, distributions shouldn't be directly sampled, but an accurate approximate of it is obtained iteratively. Similar results can be obtained using the variational Bayes technique, where the approximation is generated with an Expectation-Maximization approach.
The resulting LDA model is generative (as happens with Hidden Markov Models, Naïve Bayes, and Restricted Boltzmann Machines), therefore each variable can be simulated and observed.
Let's now see how it works on a real-world dataset—the 20 Newsgroup dataset. It's composed of a collection of e-mails exchanged in 20 newsgroups. Let's initially load it, removing the e-mail headers, footers, and quotes from replied e-mails:
In:from sklearn.datasets import fetch_20newsgroups documents = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), random_state=101).data
Check the size of the dataset (that is, how many documents), and print one of them to see what one document is actually composed of:
In:len(documents) Out:11314 In:document_num = 9960 print documents[document_num] Out:Help!!! I have an ADB graphicsd tablet which I want to connect to my Quadra 950. Unfortunately, the 950 has only one ADB port and it seems I would have to give up my mouse. Please, can someone help me? I want to use the tablet as well as the mouse (and the keyboard of course!!!). Thanks in advance.
As in the example, one guy is looking for help for his video socket on his tablet.
Now, we import the Python packages needed to run LDA. The Gensim package is one of the best ones and, as you'll see at the end of the section, it is also very scalable:
In:import gensim from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from nltk.stem import WordNetLemmatizer, SnowballStemmer np.random.seed(101)
As the first step, we should clean the text. A few steps are necessary, which is typical of any NLP text processing:
In the following piece of code, we will do exactly this: try to clean the text as much as possible and list the words composing each of them. At the end of the cell, we see how this operation changes the document seen previously:
In:lm = WordNetLemmatizer() stemmer = SnowballStemmer("english") def lem_stem(text): return stemmer.stem(lm.lemmatize(text, pos='v')) def tokenize_lemmatize(text): return [lem_stem(token) for token in gensim.utils.simple_preprocess(text) if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3] print tokenize_lemmatize(documents[document_num]) Out:[u'help', u'graphicsd', u'tablet', u'want', u'connect', u'quadra', u'unfortun', u'port', u'mous', u'help', u'want', u'tablet', u'mous', u'keyboard', u'cours', u'thank', u'advanc']
Now, as the next step, let's operate the cleaning steps on all the documents. After this, we have to build a dictionary containing how many times a word appears in the training set. Thanks to the Gensim package, this operation is straightforward:
In:processed_docs = [tokenize(doc) for doc in documents] word_count_dict = gensim.corpora.Dictionary(processed_docs)
Now, as we want to build a generic and fast solution, let's remove all the very rare and very common words. For example, we can filter out all the words appearing less than 20 times (in total) and in no more than 20% of the documents:
In:word_count_dict.filter_extremes(no_below=20, no_above=0.2)
As the next step, with such a reduced set of words, we now build the bag-of-words model for each document; that is, for each document, we create a dictionary reporting how many words and how many times the words appear:
In:bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
As an example, let's have a peek at the bag-of-words model of the preceding document:
In:bow_doc1 = bag_of_words_corpus[document_num] for i in range(len(bow_doc1)): print "Word {} ("{}") appears {} time[s]" .format(bow_doc1[i][0], word_count_dict[bow_doc1[i][0]], bow_doc1[i][1]) Out:Word 178 ("want") appears 2 time[s] Word 250 ("keyboard") appears 1 time[s] Word 833 ("unfortun") appears 1 time[s] Word 1037 ("port") appears 1 time[s] Word 1142 ("help") appears 2 time[s] Word 1543 ("quadra") appears 1 time[s] Word 2006 ("advanc") appears 1 time[s] Word 2124 ("cours") appears 1 time[s] Word 2391 ("thank") appears 1 time[s] Word 2898 ("mous") appears 2 time[s] Word 3313 ("connect") appears 1 time[s]
Now, we have arrived at the core part of the algorithm: running LDA. As for our decision, let's ask for 12 topics (there are 20 different newsletters, but some are similar):
In:lda_model = gensim.models.LdaMulticore(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=50)
Let's now print the topic composition, that is, the words appearing in each topic and their relative weight:
In:for idx, topic in lda_model.print_topics(-1): print "Topic:{} Word composition:{}".format(idx, topic) print Out: Topic:0 Word composition:0.015*imag + 0.014*version + 0.013*avail + 0.013*includ + 0.013*softwar + 0.012*file + 0.011*graphic + 0.010*program + 0.010*data + 0.009*format Topic:1 Word composition:0.040*window + 0.030*file + 0.018*program + 0.014*problem + 0.011*widget + 0.011*applic + 0.010*server + 0.010*entri + 0.009*display + 0.009*error Topic:2 Word composition:0.011*peopl + 0.010*mean + 0.010*question + 0.009*believ + 0.009*exist + 0.008*encrypt + 0.008*point + 0.008*reason + 0.008*post + 0.007*thing Topic:3 Word composition:0.010*caus + 0.009*good + 0.009*test + 0.009*bike + 0.008*problem + 0.008*effect + 0.008*differ + 0.008*engin + 0.007*time + 0.006*high Topic:4 Word composition:0.018*state + 0.017*govern + 0.015*right + 0.010*weapon + 0.010*crime + 0.009*peopl + 0.009*protect + 0.008*legal + 0.008*control + 0.008*drug Topic:5 Word composition:0.017*christian + 0.016*armenian + 0.013*jesus + 0.012*peopl + 0.008*say + 0.008*church + 0.007*bibl + 0.007*come + 0.006*live + 0.006*book Topic:6 Word composition:0.018*go + 0.015*time + 0.013*say + 0.012*peopl + 0.012*come + 0.012*thing + 0.011*want + 0.010*good + 0.009*look + 0.009*tell Topic:7 Word composition:0.012*presid + 0.009*state + 0.008*peopl + 0.008*work + 0.008*govern + 0.007*year + 0.007*israel + 0.007*say + 0.006*american + 0.006*isra Topic:8 Word composition:0.022*thank + 0.020*card + 0.015*work + 0.013*need + 0.013*price + 0.012*driver + 0.010*sell + 0.010*help + 0.010*mail + 0.010*look Topic:9 Word composition:0.019*space + 0.011*inform + 0.011*univers + 0.010*mail + 0.009*launch + 0.008*list + 0.008*post + 0.008*anonym + 0.008*research + 0.008*send Topic:10 Word composition:0.044*game + 0.031*team + 0.027*play + 0.022*year + 0.020*player + 0.016*season + 0.015*hockey + 0.014*leagu + 0.011*score + 0.010*goal Topic:11 Word composition:0.075*drive + 0.030*disk + 0.028*control + 0.028*scsi + 0.020*power + 0.020*hard + 0.018*wire + 0.015*cabl + 0.013*instal + 0.012*connect
Unfortunately, LDA doesn't provide a name for each topic; we should do it manuall yourselves, based on our interpretation of the results of the algorithm. After having carefully examined the composition, we can name the discovered topics as follows:
Topic |
Name |
---|---|
0 |
Software |
1 |
Applications |
2 |
Reasoning |
3 |
Transports |
4 |
Government |
5 |
Religion |
6 |
People actions |
7 |
Middle-East |
8 |
PC Devices |
9 |
Space |
10 |
Games |
11 |
Drives |
Let's now try to understand what topics are represented in the preceding document and their weights:
In: for index, score in sorted( lda_model[bag_of_words_corpus[document_num]], key=lambda tup: -1*tup[1]): print "Score: {} Topic: {}".format(score, lda_model.print_topic(index, 10)) Out:Score: 0.938887758964 Topic: 0.022*thank + 0.020*card + 0.015*work + 0.013*need + 0.013*price + 0.012*driver + 0.010*sell + 0.010*help + 0.010*mail + 0.010*look
The highest score is associated with the topic PC Devices. Based on our previous knowledge of the collections of documents, it seems that the topic extraction has performed quite well.
Now, let's evaluate the model as a whole. The perplexity (or its logarithm) provides us with a metric to understand how well LDA has performed on the training dataset:
In:print "Log perplexity of the model is", lda_model.log_perplexity(bag_of_words_corpus) Out:Log perplexity of the model is -7.2985188569
In this case, the perplexity is 2-7.298, and it's connected to the (log) likelihood that the LDA model is able to generate the documents in the test set, given the distribution of topics for those documents. The lower the perplexity, the better the model, because it basically means that the model can regenerate the text quite well.
Now, let's try to use the model on an unseen document. For simplicity, the document contains only the sentences, Golf or tennis? Which is the best sport to play?:
In:unseen_document = "Golf or tennis? Which is the best sport to play?" bow_vector = word_count_dict.doc2bow( tokenize_lemmatize(unseen_document)) for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]): print "Score: {} Topic: {}".format(score, lda_model.print_topic(index, 5)) Out:Score: 0.610691655136 Topic: 0.044*game + 0.031*team + 0.027*play + 0.022*year + 0.020*player Score: 0.222640440339 Topic: 0.018*state + 0.017*govern + 0.015*right + 0.010*weapon + 0.010*crime
As expected, the topic with a higher score is the one about "Games", followed by others with a relatively smaller score.
How does LDA scale with the size of the corpus? Fortunately, very well; the algorithm is iterative and allows online learning, similar to the mini-batch one. The key for the online process is the .update()
method offered by LdaModel
(or LdaMulticore
).
We will do this test on a subset of the original corpus composed of the first 1,000 documents, and we will update our LDA model with batches of 50, 100, 200, and 500 documents. For each mini-batch updating the model, we will record the time and plot them on a graph:
In:small_corpus = bag_of_words_corpus[:1000] batch_times = {} for batch_size in [50, 100, 200, 500]: print "batch_size =", batch_size tik0 = time.time() lda_model = gensim.models.LdaModel(num_topics=12, id2word=word_count_dict) batch_times[batch_size] = [] for i in range(0, len(small_corpus), batch_size): lda_model.update(small_corpus[i:i+batch_size], update_every=25, passes=1+500/batch_size) batch_times[batch_size].append(time.time() - tik0) Out:batch_size = 50 batch_size = 100 batch_size = 200 batch_size = 500
Note that we've set the update_every
and passes
parameters in the model update. This is necessary to make the model converge at each iteration and not return a non-converging model. Note that 500 has been chosen heuristically; if you set it lower, you'll have many warnings from Gensim about the non-convergence of the model.
Let's now plot the results:
In:plt.plot(range(50, 1001, 50), batch_times[50], 'g', label='size 50') plt.plot(range(100, 1001, 100), batch_times[100], 'b', label='size 100') plt.plot(range(200, 1001, 200), batch_times[200], 'k', label='size 200') plt.plot(range(500, 1001, 500), batch_times[500], 'r', label='size 500') plt.xlabel("Training set size") plt.ylabel("Training time") plt.xlim([0, 1000]) plt.legend(loc=0) plt.show() Out:
The bigger the batch, the faster the training. (Remember that big batches need fewer passes while updating the model.) On the other hand, the bigger the batch, the greater the amount of memory you need in order to store and process the corpora. Thanks to the mini-batches update method, LDA is able to scale to process a corpora of million documents. In fact, the implementation provided by the Gensim package is able to scale and process the whole of Wikipedia in a couple of hours on a domestic computer. If you're brave enough to try it yourself, here are the complete instructions to accomplish the task, provided by the author of the package:
https://radimrehurek.com/gensim/wiki.html
Gensim is very flexible and built to crunch big textual corpora; in fact, this library is able to scale without any modification or additional download:
update
method available in LdaModel
and LdaMulticore
(as shown in the previous example).models.lda_dispatcher
(as a scheduler) and models.lda_worker
(as a worker process) objects, both provided by Gensim.Beyond the classical LDA algorithm, Gensim also provides its hierarchical version, named Hierarchical Dirichlet Processing (HDP). Using this algorithm, topics follow a multilevel structure, enabling the user to understand complex corpora better (that is, where some documents are generic and some specific on a topic). This module is fairly new and, as of the end of 2015, not as scalable as the classic LDA.
3.19.74.29