Chapter 3. Topic Modeling – Changing Concerns in the State of the Union Addresses

A huge source of data right now is the volumes of unstructured, natural-language data that's everywhere on the Internet. Think of all the news articles, blog posts, Twitter posts, and YouTube comments as well as the thousands of other ways that people can create and share textual content online. What they're saying may be important to you, and being able to track what subjects they are talking about is incredibly useful to become aware of the trends and conversations.

A tool to explore the information a group of text documents discusses is called topic modeling. This is a technique to identify the "topics" discussed in a collection of documents, although as we'll see, "topics" is defined a little differently here than it is in informal conversation. The strength of these models is that they don't assume that each document talks only about one thing. Instead, they model documents as collections of topics. This is incredibly powerful in that it allows more complex conceptions of what a document is as well as more complex patterns between documents.

In this chapter, we will cover the following topics:

  • Understanding data in State of the Union addresses
  • Understanding topic modeling
  • Preparing for visualizations
  • Setting up the project
  • Getting the data
  • Visualizing data with D3 and ClojureScript
  • Exploring the topics

Understanding data in the State of Union addresses

In this chapter, we'll apply topic modeling to the (SOTU) State of the Union addresses presented by the presidents of the United States of America. Each January or February, the President addresses the US Senate and the House of Representatives either in person or in writing, and talks about how the country is doing as well as outlining his agenda for the coming year. The speeches can be fairly short, but the written reports can be much longer. George Washington's first State of the Union address from 1790 had less than 500 words. Barack Obama's latest SOTU (at the time of this writing in 2013) had over 3,000 words. Jimmy Carter had the longest SOTU address, which he delivered in writing in 1981. It is almost 14,000 words long.

The gradual increase in the length of the SOTU address, which climaxed around 1910, was because starting from Thomas Jefferson's 1801 address up until William H. Taft's 1912 address, the SOTU address was a written report delivered before Congress. The following graph represents the increase in the word counts of SOTU addresses:

Understanding data in the State of Union addresses

Of course, as the situation has changed both domestically and internationally, so have the topics that the President discusses in the SOTU addresses. You wouldn't expect John Adams' 1800 address to talk about the same things as Bill Clinton's 2000 address. This immediately raises the question: what topics have the Presidents talked about in their SOTU addresses and how have those topics changed over time?

This isn't a new question, even for topic modeling. Xuerui Wang and Andrew McCallum covered it as one of several examples in their 2006 paper, Topics over time: A non-Markov continuous-time model of topical trends (2006) (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.152.2460). In this paper, they present a way of analyzing a series of time-stamped documents in order to get an improved understanding of how the topics interact over time. In fact, this is an area of considerable further research, and there are a number of other extensions to topic modeling that take time into account.

In this chapter, we're only going to cover the most widely used topic modeling algorithm today: LDA (Latent Dirichlet Allocation). With an understanding of this procedure and the underlying thought behind it, you can understand Wang's and McCallum's Topics over Time algorithm without too much difficulty.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.179.100