Chapter 8. Topic Modeling in Text

Topic modeling in text is loosely related to the summarization techniques we explored in Chapter 7, Automatic Text Summarization. However, topic modeling involves a more complex mathematical foundation and it produces a different type of result. The goal of text summarization is to produce a version of a text that is reduced but still expresses common themes or concepts in a text, whereas the goal of topic modeling is to expose the underlying concepts themselves.

To extend our Chapter 7, Automatic Text Summarization metaphor, in which text summarization was compared to building a scale model of a house, topic modeling is like trying to describe the purpose of a set of houses based on multiple sample dwellings. For example, the topic model of one neighborhood of houses might be busy family, storage space, and low maintenance and another neighborhood could have houses described with the words social, entertaining, luxury, and showplace. These two models clearly represent two different types of houses, designed and built with two different purposes.

How does topic modeling work? Is it more sophisticated than simply counting how many times each word occurs? In this chapter, we will learn:

  • What is topic modeling? What are some of the common techniques we can use to accomplish this task?
  • What are the currently available libraries and tools for applying topic modeling in Python, and how do they work?
  • How can we compare the effectiveness of a topic modeling approach, in terms of the results it generates?
  • How do you apply topic modeling to a real-world problem?

What is topic modeling?

Just like with the keyword-based text summarization techniques we looked at in Chapter 7, Automatic Text Summarization, topic modeling also takes into account what words are used in a text. However, the focus of topic modeling is more about themes and concepts, and not solely about summarizing text. Topic models can be used for summarization, but they can also be used for many other goals:

  • Topic models can assist with organization of documents, for example, to group news articles together into a cohesive section
  • Topic models can help us make recommendations about what to read next by finding materials that have a topic list in common
  • Topic models can improve search results by revealing documents that may use a mix of different keywords but are about the same idea

One critical component of the type topic modeling we will investigate in this chapter is that the analyst does not need to know what the topics or keywords are in advance. Instead, the model is created in an unsupervised way. In unsupervised topic modeling, the list of topics is built by the computer using probabilities to determine what the topics should be and what documents and words reference those topics.

Researchers for the social media site Facebook published an article in 2013 about how that company uses one type of topic modeling to understand the topics users post about, and more importantly, to understand how audiences respond to those postings. The article is available for download on the Facebook Research blog here: https://research.facebook.com/publications/gender-topic-and-audience-response-an-analysis-of-user-generated-content-on-facebook/. In the paper, the authors explain their purpose:

"... We examine whether male and female [social network service] users talk about different topics, and how their audience of friends and followers respond."

They provide a list of 25 topics that they discovered are common in Facebook postings, and they list the corresponding keywords that support those topics. For example, they list the topic Sleep and the associated keywords last night, wake up, bed, nap, asleep. The topic Food includes keywords such as lunch, coffee, chicken, ice cream.

Again, the critical component of unsupervised topic modeling is that we do not need to construct a list of keywords and topics in advance. The Facebook researchers did not need to know in advance that they had a number of posts about Sleep using the keywords bed, nap, and awake. Rather, the topic lists are generated and grouped by the topic modeling program, at which point the human analysts can suggest the umbrella terms, such as Sleep and Food, which encapsulate the ideas in each topic list. In the next section, we will take a closer look at how one of these unsupervised topic modeling programs works.

Note

You might be curious to know what the Facebook researchers found out about the differences in topics and responses between male and female users of their service. At the end of their paper they explain:

"Using topic modeling, we find that women are more likely to broadcast personal issues, while men are more likely to post philosophical topics. Although men get fewer comments than women, masculine topics receive more comments."

For more about the Facebook study, you can download the original paper and read related work in the machine learning area of their research blog, available at: https://research.facebook.com/publications/machinelearningarea/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.71.159