Chapter 10. Topic Modeling

Topic modeling is a relatively recent and exciting area that originated from the fields of natural language processing and information retrieval but has seen applications in a number of other domains as well. Many problems in classification, such as sentiment analysis, involve assigning a single class to a particular observation. In topic modeling, the key idea is that we can assign a mixture of different classes to an observation. As the field is inspired from information retrieval, we often think of our observations as documents and our output classes as topics. In many applications, this is actually the case and so we will focus on the domain of text documents and their topics, this being a very natural way to learn about this important model. In particular, we'll focus on a technique known as Latent Dirichlet Allocation (LDA), which is the most prominently used method for topic modeling.

An overview of topic modeling

In Chapter 8, Probabilistic Graphical Models, we saw how we can use a bag of words as the features of a Naïve Bayes model in order to perform sentiment analysis. There, the specific predictive task involved determining whether a particular movie review was expressing a positive sentiment or a negative sentiment. We explicitly assumed that the movie review was exclusively expressing only one possible sentiment. Each of the words used as features (such as bad, good, fun, and so on) had a different likelihood of appearing in a review under each sentiment.

To compute the model's decision, we basically compute the likelihood of all the words in a particular review under one class, and compare this to the likelihood of all the words having been generated by the other class. We adjusted these likelihoods using the prior probability of each class so that when we know that one class is more popular in the training data, we expect to find it more frequently represented on unseen data in the future. There was no opportunity for a movie review to be partially positive so that some of the words came from the positive class and partially negative so that the rest of the words occurred due to the negative class.

The core premise behind topic models is that in our problem, we have a set of features and a set of hidden or latent variables that generate these features. Crucially, each observation in our data contains features that have been generated from a mixture of a subset of these hidden variables. For example, an essay, website, or news article might have a central topic or theme such as politics, but might also include one or more elements of other themes as well, such as human rights, history, or economics.

In the image domain, we might be interested in identifying a particular object in a scene from a set of visual features such as shadows and surfaces. These, in turn, might be the product of a mixture of different objects. Our task in topic modeling is to observe the words inside a document or the pixels and visual features of an image and from these determine the underlying mix of topics and objects respectively.

Topic modeling on text data can be used in a number of different ways. One possible application is to group together similar documents, either based on their most predominant topic or based on their topical mix. Thus, it can be viewed as a form of clustering. By studying the topic composition, most frequent words, as well as the relative sizes of the clusters we obtain, we are able to summarize information about a particular collection of documents.

We can use the most frequent words and topics of a cluster to describe a cluster directly and in turn this might be useful for automatically generating tags, for example to improve the search capabilities of an information retrieval service for our documents. Yet another example might be to automatically recommend Twitter hashtags once we build a topic model for a database of tweets.

When we describe documents such as websites using a bag of words approach, each document is essentially a vector indexed by the words in our dictionary. The elements of the vector are either counts of the various words or binary variables capturing whether a word was present in the document. Either way, this representation is a good method to encode text into a numerical format, but the result is a sparse vector in a high dimensional space as the word dictionary is typically large. Under a topic model, each document is represented by a mixture of topics. As this number tends to be much smaller than the dictionary size, topic modeling can also function as a form of dimensionality reduction.

Finally, topic modeling can also be viewed as a predictive task for classification. If we have a collection of documents labeled with a predominant theme label, we can perform topic modeling on this collection. If the predominant topic clustering we obtain from this method coincides with our labeled categories, we can use the model to predict a topical mixture for an unknown document and classify it according to the most predominant topic. We'll see an example of this later on in this chapter. We will now introduce the most well-known technique to perform topic modeling, Latent Dirichlet Allocation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.64.172