Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is the prototypical method to perform topic modeling. Rather unfortunately, the acronym LDA is also used for another method in machine learning, Linear Discriminant Analysis. This latter method is completely different to Latent Dirichlet Allocation and is commonly used as a way to perform dimensionality reduction and classification. Needless to say, we will use LDA to refer to Latent Dirichlet Allocation throughout this book.

Although LDA involves a substantial amount of mathematics, it is worth exploring some of its technical details in order to understand how the model works and the assumptions that it uses. First and foremost, we should learn about the Dirichlet distribution, which lends its name to LDA.

Note

An excellent reference for a fuller treatment of Topic Models with LDA is the chapter Topic Models in the book Text Mining: Classification, Clustering, and Applications, edited by A. Srivastava and M. Sahami and published by Chapman & Hall, 2009. It can be found online at http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf.

The Dirichlet distribution

Suppose we have a classification problem with K classes and the probability of each class is fixed. Given a vector of length K containing counts of the occurrence of each class, we can estimate the probabilities of each class by just dividing each entry in the vector by the sum of all the counts.

Now suppose we would like to predict the number of times each class will occur over a series of N trials. If we have two classes, we can model this with a binomial distribution, as we would normally do in a coin flip experiment. For K classes, the binomial distribution generalizes to the multinomial distribution, where the probability of each class, pi, is fixed and the sum of all instances of pi equals one. Now, suppose that we wanted to model the random selection of a particular multinomial distribution with K categories. The Dirichlet distribution achieves just that. Here is its form:

The Dirichlet distribution

This equation seems complex, but if we break it down to its constituent parts and label the symbols used, we will then be able to have a better understanding. To begin with, the The Dirichlet distribution term is a vector with K components, xk, representing a particular multinomial distribution. The vector The Dirichlet distribution is also a K component vector containing the K parameters, αk, of the Dirichlet distribution. Thus, we are computing the probability of selecting a particular multinomial distribution given a particular parameter combination. Notice that we provide the Dirichlet distribution with a parameter vector whose length is the same as the number of classes of the multinomial distribution that it will return.

The fraction before the large product on the right-hand side of the equation is a normalizing constant, which depends only on the values of the Dirichlet parameters and is expressed in terms of the gamma function. For completeness, the gamma function, a generalization of the factorial function, is given by the following:

The Dirichlet distribution

Lastly, in the final product, we see that every parameter, αk, is paired with the corresponding component of the multinomial distribution, xk, in forming the terms of the product. The important point to remember about this distribution is that by modifying the αk parameters, we are modifying the probabilities of the different multinomial distributions that we can draw.

We are especially interested in the total sum of the αk parameters as well as the relative proportions among them. A large total for the αk parameters tends to produce a smoother distribution involving a mix of many topics and this distribution is more likely to follow the pattern of alpha parameters in their relative proportions.

A special case of the Dirichlet distribution is the Symmetrical Dirichlet distribution, in which all the αk parameters have an identical value. When the αk parameters are identical and large in value, we are likely to sample a multinomial distribution that is close to being uniform. Thus, the symmetrical Dirichlet distribution is used when we have no information about a preference over a particular topic distribution and we consider all topics to be equally likely.

Similarly, suppose we had a skewed vector of αk parameters with large absolute values. For example, we might have a vector in which one of the αk parameters was much higher than the others, indicating a preference for selecting one of the topics. If we used this as an input to the Dirchlet distribution, we would likely sample a multinomial distribution in which the aforementioned topic was very probable.

By contrast, if the αk parameters add up to a small number, this usually results in a peaky distribution. This is one in which only one or two of the topics are likely and the rest are unlikely. Consequently, if we want to model the process of drawing a multinomial with only a few topics selected, we would use low values for the αk parameters, whereas if we want a good mix, we would use larger values. The following two plots will help visualize this behavior. The first plot is for a symmetric Dirichlet distribution:

The Dirichlet distribution

In this plot, each column contains four random samples of a multinomial distribution generated using a symmetric Dirichlet distribution for five topics. In the first column, all the αk parameters are set to 0.1. Note that the distributions are very peaky and because all the αk parameters are equally likely, there is no preference for which topic will tend to be chosen as the highest peak.

In the middle column, the αk parameters are set to 1 and as the sum of the parameters is now larger, we are seeing a greater mix of topics, even if the distribution is still skewed. When we set the αk parameters to the value of 10 in the third column, we see that the samples are now much closer to a uniform distribution.

In many scenarios, we use the Dirichlet distribution as a prior distribution; that is, a distribution that describes our prior beliefs about the multinomial distribution that we are trying to sample. When the sum of the αk parameters is high, we tend to think of our prior as having very strong beliefs. In the next plot, we will skew the distribution of our αk parameters to favor the first topic:

The Dirichlet distribution

In the first column, the average value of the αk parameters is 0.1, but we adjusted their distribution so that α1, which corresponds to the first topic, now has a value four times as high as the others. We see that this has increased the probability of the first topic featuring prominently in the sampled multinomial distribution, but it is not guaranteed to be the distribution's mode.

In the middle column, where the average of the αk parameters is now 1 but with the same skew, topic 1 is the mode of the distribution in all the samples. Additionally, there is still a high variance in how the other topics will be selected. In the third column, we have a smoother distribution that simultaneously mixes all five topics but enforces the preference for the first topic.

Now that we have an idea of how this distribution works, we will see in the next section how it is used in building topic models with LDA.

The generative process

We delved into the Dirichlet distribution with significant detail because it is at the heart of how topic modeling with LDA operates. Armed with this understanding, we'll now describe the generative process behind LDA.

The generative process is aptly named as it describes how the LDA model assumes that documents, topics, and words are generated in our data. This process is essentially an illustration of the model's assumptions. The optimization procedures that are used in order to fit an LDA model to data, essentially estimate the parameters of the generative process. We'll now see how this process works:

  1. For each of our K topics, draw a multinomial distribution, φk, over the words in our vocabulary using a Dirichlet distribution parameterized by a vector α. This vector has length V, the size of our vocabulary. Even though we sample from the same Dirichlet distribution each time, we've seen that the sampled multinomial distributions will likely differ from each other.
  2. For every document, d, that we want to generate:
    • Determine the mix of topics for this document by drawing a multinomial distribution, θk, from a Dirichlet distribution, this time parameterized by a vector β of length K, the number of topics. Each document will thus have a different mix of topics.
    • For every word, w, in the document that we want to generate:
      • Use the multinomial topic distribution for this document, θk, to draw a topic with which this word will be associated.
      • Use that particular topic's distribution, φk, to pick the actual word.

Note that we use two differently parameterized Dirichlet distributions in our generative process, one for drawing a multinomial distribution of topics and another for drawing a multinomial distribution of words. Although the model is simple, it does capture certain intuitions about documents and topics. In particular, it captures the notion that documents about different topics will, in general, contain different words in them and in different proportions. Additionally, a particular word can be associated with more than one topic, but for some topics it will appear with higher frequency than others. Documents may have a central topic, but they may also discuss other topics as well and therefore we can think of a document as dealing with a mixture of topics. A topic that is more important in a document will be so because a greater percentage of the words in the document deal with that topic.

Dirichlet distributions can be smooth or skewed, and the mixture of components can be controlled via the αk parameters. Consequently, by tuning the Dirichlet distributions appropriately, this process can produce documents with a single theme as well as documents covering many topics.

At the same time, it is important to bear in mind the limitations of the model through some of the simplifying assumptions that it makes. The model completely ignores the word order inside a document, and the generative process is memoryless, in that when it generates the nth word of a document, it does not take into account the existing n-1 words that were previously drawn for that document.

Furthermore, LDA does not attempt to model any relationships between the topics that are drawn for a document so that we do not try to organize topics that are more likely to co-occur, such as weather and travel or biology and chemistry. This is a significant limitation of the LDA model, for which there are proposed solutions. For example, one variant of LDA is known as the Correlated Topic Model (CTM), which follows the same generative process as LDA but uses a different distribution that allows one to also model the correlations between the topics. In our experimental section, we will also see an implementation of the CTM model.

Note

The correlated topic model was presented in A Correlated Topic Model of Science by D. M. Blei and J. D. Lafferty, published by the Annals of Applied Statistics in 2007.

Fitting an LDA model

Fitting an LDA model to a corpus of documents essentially involves computationally estimating the multinomial topic and word distributions, φk and θd, that would most likely be able to generate the data assuming the LDA generative process. These variables are hidden or latent, which explains why the method is known as Latent Dirichlet Allocation.

A number of optimization procedures have been proposed to solve this problem, but the mathematical details are beyond the scope of this book. We will mention two of these that we will encounter in the next section. The first method is known as Variational Expectation Maximization (VEM) and is a variant of the well-known Expectation Maximization (EM) algorithm. The second is known as Gibbs sampling, which is a method based on Markov Chain Monte Carlo (MCMC).

Note

For a tutorial on the EM algorithm and VEM, we recommend The Variational Approximation for Bayesian Inference by Dimitris G. Tzikas and others in the November 2008 issue of the IEEE Signal Processing Magazine. For Gibbs sampling, there is a 1992 article in The American Statistician by George Casella, titled Explaining the Gibbs Sampler. Both are readable, but quite technical. A more thorough tutorial on Gibbs sampling is Gibbs Sampling for the Uninitiated by Philip Resnik and Eric Hardisty. This last reference is less mathematically demanding and can found online at http://www.cs.umd.edu/~hardisty/papers/gsfu.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.47.25