Topic modeling in text is loosely related to the summarization techniques we explored in Chapter 7, Automatic Text Summarization. However, topic modeling involves a more complex mathematical foundation and it produces a different type of result. The goal of text summarization is to produce a version of a text that is reduced but still expresses common themes or concepts in a text, whereas the goal of topic modeling is to expose the underlying concepts themselves.
To extend our Chapter 7, Automatic Text Summarization metaphor, in which text summarization was compared to building a scale model of a house, topic modeling is like trying to describe the purpose of a set of houses based on multiple sample dwellings. For example, the topic model of one neighborhood of houses might be busy family, storage space, and low maintenance and another neighborhood could have houses described with the words social, entertaining, luxury, and showplace. These two models clearly represent two different types of houses, designed and built with two different purposes.
How does topic modeling work? Is it more sophisticated than simply counting how many times each word occurs? In this chapter, we will learn:
Just like with the keyword-based text summarization techniques we looked at in Chapter 7, Automatic Text Summarization, topic modeling also takes into account what words are used in a text. However, the focus of topic modeling is more about themes and concepts, and not solely about summarizing text. Topic models can be used for summarization, but they can also be used for many other goals:
One critical component of the type topic modeling we will investigate in this chapter is that the analyst does not need to know what the topics or keywords are in advance. Instead, the model is created in an unsupervised way. In unsupervised topic modeling, the list of topics is built by the computer using probabilities to determine what the topics should be and what documents and words reference those topics.
Researchers for the social media site Facebook published an article in 2013 about how that company uses one type of topic modeling to understand the topics users post about, and more importantly, to understand how audiences respond to those postings. The article is available for download on the Facebook Research blog here: https://research.facebook.com/publications/gender-topic-and-audience-response-an-analysis-of-user-generated-content-on-facebook/. In the paper, the authors explain their purpose:
"... We examine whether male and female [social network service] users talk about different topics, and how their audience of friends and followers respond."
They provide a list of 25 topics that they discovered are common in Facebook postings, and they list the corresponding keywords that support those topics. For example, they list the topic Sleep and the associated keywords last night, wake up, bed, nap, asleep. The topic Food includes keywords such as lunch, coffee, chicken, ice cream.
Again, the critical component of unsupervised topic modeling is that we do not need to construct a list of keywords and topics in advance. The Facebook researchers did not need to know in advance that they had a number of posts about Sleep using the keywords bed, nap, and awake. Rather, the topic lists are generated and grouped by the topic modeling program, at which point the human analysts can suggest the umbrella terms, such as Sleep and Food, which encapsulate the ideas in each topic list. In the next section, we will take a closer look at how one of these unsupervised topic modeling programs works.
You might be curious to know what the Facebook researchers found out about the differences in topics and responses between male and female users of their service. At the end of their paper they explain:
"Using topic modeling, we find that women are more likely to broadcast personal issues, while men are more likely to post philosophical topics. Although men get fewer comments than women, masculine topics receive more comments."
For more about the Facebook study, you can download the original paper and read related work in the machine learning area of their research blog, available at: https://research.facebook.com/publications/machinelearningarea/.
3.145.71.159