Performing topic modeling with MALLET

Previously in this chapter, we looked at a number of ways to programmatically see what's present in documents. We saw how to identify people, places, dates, and other things in documents. We saw how to break things up into sentences.

Another, more sophisticated way to discover what's in a document is to use topic modeling. Topic modeling attempts to identify a set of topics that are contained in the document collection. Each topic is a cluster of words that are used together throughout the corpus. These clusters are found in individual documents to varying degrees, and a document is composed of several topics to varying extents. We'll take a look at this in more detail in the explanation for this recipe.

To perform topic modeling, we'll use MALLET (http://mallet.cs.umass.edu/). This is a library and utility that implements topic modeling in addition to several other document classification algorithms.

Getting ready

For this recipe, we'll need these lines in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [cc.mallet/mallet "2.0.7"]])

Our imports and requirements for this are pretty extensive too, as shown here:

(require '[clojure.java.io :as io])
(import [cc.mallet.util.*]
        [cc.mallet.types InstanceList]
        [cc.mallet.pipe
         Input2CharSequence TokenSequenceLowercase
         CharSequence2TokenSequence SerialPipes
         TokenSequenceRemoveStopwords
         TokenSequence2FeatureSequence]
        [cc.mallet.pipe.iterator FileListIterator]
        [cc.mallet.topics ParallelTopicModel]
        [java.io FileFilter])

Again, we'll use the State of the Union addresses that we've already seen several timesin this chapter. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu directory.

How to do it…

We'll need to work the documents through several phases to perform topic modeling, as follows:

  1. Before we can process any documents, we'll need to create a processing pipeline. This defines how the documents should be read, tokenized, normalized, and so on:
    (defn make-pipe-list []
      (InstanceList.
        (SerialPipes.
          [(Input2CharSequence. "UTF-8")
           (CharSequence2TokenSequence.
             #"p{L}[p{L}p{P}]+p{L}")
           (TokenSequenceLowercase.)
           (TokenSequenceRemoveStopwords. false false)
           (TokenSequence2FeatureSequence.)])))
  2. Now, we'll create a function that takes the processing pipeline and a directory of data files, and it will run the files through the pipeline. This returns an InstanceList, which is a collection of documents along with their metadata:
    (defn add-directory-files [instance-list corpus-dir]
      (.addThruPipe
        instance-list
        (FileListIterator.
          (.listFiles (io/file corpus-dir))
          (reify FileFilter
            (accept [this pathname] true))
          #"/([^/]*).txt$"
          true)))
  3. The last function takes the InstanceList and some other parameters and trains a topic model, which it returns:
    (defn train-model
      ([instances] (train-model 100 4 50 instances))
      ([num-topics num-threads num-iterations instances]
       (doto (ParallelTopicModel. num-topics 1.0 0.01)
         (.addInstances instances)
         (.setNumThreads num-threads)
         (.setNumIterations num-iterations)
         (.estimate))))

Now, we can take these three functions and use them to train a topic model. While training, it will output some information about the process, and finally, it will list the top terms for each topic:

user=> (def pipe-list (make-pipe-list))
user=> (add-directory-files pipe-list "sotu/")
user=> (def tm (train-model 10 4 50 pipe-list))
…
INFO:
0       0.1     government federal year national congress war
1       0.1     world nation great power nations people
2       0.1     world security years programs congress program
3       0.1     law business men work people good
4       0.1     america people americans american work year
5       0.1     states government congress public people united
6       0.1     states public made commerce present session
7       0.1     government year department made service legislation
8       0.1     united states congress act government war
9       0.1     war peace nation great men people

How it works…

It's difficult to succinctly and clearly explain how topic modeling works. Conceptually, it assigns words from the documents to buckets (topics). This is done in such a way that randomly drawing words from the buckets will most probably recreate the documents.

Interpreting the topics is always interesting. Generally, it involves taking a look at the top words for each topic and cross-referencing them with the documents that scored most highly for this topic.

For example, take the fourth topic, with the top words law, business, men, and work. The top-scoring document for this topic was the 1908 SOTU, with a distribution of 0.378. This was given by Theodore Roosevelt, and in his speech, he talked a lot about labor issues and legislation to rein in corrupt corporations. All of the words mentioned were used a lot, but understanding exactly what the topic is about isn't evident without actually taking a look at the document itself.

See also…

There are a number of good papers and tutorials on topic modeling. There's a good tutorial written by Shawn Graham, Scott Weingart, and Ian Milligan at http://programminghistorian.org/lessons/topic-modeling-and-mallet

For a more rigorous explanation, check out Mark Steyvers's introduction Probabilistic Topic Models, which you can see at http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf

For some information on how to evaluate the topics that you get, see http://homepages.inf.ed.ac.uk/imurray2/pub/09etm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.34.62