Understanding topic modeling

A topic model is a statistical model of the topics in a document. The assumption is that if 10 percent of a document talks about the military and 40 percent of it talks about the economy (and 50 percent talks about other things), then there should be roughly four times as many words about economics as about the military.

An early form of topic modeling was described by Christos Papadimitriou and others in their 1998 paper, Latent Semantic Indexing: A probabilistic analysis (http://www.cs.berkeley.edu/~christos/ir.ps). This was refined by Thomas Hofmann in 1999 with Probabilistic Latent Semantic Indexing (http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf).

In 2003, David Blei, Andrew Ng, and Michael I. Jordan published their paper, Latent Dirichlet Allocation (http://jmlr.csail.mit.edu/papers/v3/blei03a.html). Currently, this is the most common type of topic modeling. It's simple, easy to get started, and widely available. Most work in the field since then has been developing extensions to the original LDA topic modeling method. This is the procedure that we'll learn about and use in this chapter.

In LDA, each document is modeled as a bag of words, each word drawn from a number of topics. So each word in the document is the result of one of those topics. The model takes the following steps to create each document:

  1. Select a distribution for the topic in the document.
  2. Select a distribution for the words from the topic.
  3. Select a topic and then a word from that topic from those distributions for each word in the document.

The distributions for the topics and words use a Dirichlet distribution for their prior probability, which is the assumed uncertainty about the distribution of topics and words before considering any evidence or documents. However, as they are trained on a set of input documents, these distributions more accurately reflect the data they've seen so far, and so they are able to more accurately categorize future documents.

A short example may be helpful. Initially the distributions are picked randomly. Afterwards, we'll train on one document. Say we have a document with the following words: budget, spending, army, navy, plane, soldier, and dollars. The model knows from previous training that the words budget, spending, and dollars all relate to a topic on finance, while army, navy, plane, and soldier relate to a topic on the military, and plane relates to one on travel. This may suggest that the document is 35 percent about finance, 50 percent about the military, and 10 percent about travel. Military would be the dominant topic, but other topics would be represented as well.

If the LDA is in its training phase, then the presence of those words would slightly strengthen the association between all of the words listed, between those words and the other words in the document, and between those words and the topics that represent the relationship between them.

One twist to this is that the topics aren't named. In the previous example, I said that there were topics about finance, the military, and travel. However, LDA would see those as topics 1, 2, and 3. The labels are interpretations I would give based on the terms in those topics and the documents that scored high in them. One of the tasks when using LDA is investigating and interpreting the topics. We'll see several examples of this at the end of the chapter when we explore the results of our analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.245.167