Topic modeling descriptions

Another way to gain a better understanding of the descriptions is to use topic modeling. We learned about this text mining and machine learning algorithm in Chapter 3, Topic Modeling – Changing Concerns in the State of the Union Addresses. In this case, we'll see if we can use it to create topics over these descriptions and to pull out the differences, trends, and patterns from this set of texts.

First, we'll create a new namespace to handle our topic modeling. We'll use the src/ufo_data/tm.clj file. The following is the namespace declaration for it:

(ns ufo-data.tm
  (:require [clojure.java.io :as io]
            [clojure.string :as str]
            [clojure.pprint :as pp])
  (:import [cc.mallet.util.*]
           [cc.mallet.types InstanceList]
           [cc.mallet.pipe
            Input2CharSequence TokenSequenceLowercase
            CharSequence2TokenSequence SerialPipes
            TokenSequenceRemoveStopwords
            TokenSequence2FeatureSequence]
           [cc.mallet.pipe.iterator ArrayIterator]
           [cc.mallet.topics ParallelTopicModel]
           [java.io FileFilter]
           [java.util Formatter Locale]))

The process for generating the topic model is very similar to the process that we used in Chapter 3, Topic Modeling – Changing Concerns in the State of the Union Addresses. The first change that we need to make is that we'll load the instances from the in-memory data that we read earlier in this chapter. We'll create a function that pushes an input collection into an array and uses ArrayIterator to then feed that array into the processing pipeline. The function to train the data is the same as it was in the previous chapter.

In this chapter, we'll look at more functions that help us introspect on the trained model, the instances, and the probabilities and keywords that are important to each topic. The first function returns the words that apply to a topic and their weights. We get the feature vectors from the model, and the words themselves from the instance list as follows:

(defn get-topic-words [model instances topic-n]
  (let [topic-words (.getSortedWords model)
        data-alpha (.getDataAlphabet instances)]
    (map #(vector (.lookupObject data-alpha (.getID %))
                  (.getWeight %))
         (iterator-seq (.. topic-words (get topic-n)
                         iterator)))))

The other reporting function that we'll use ranks the instances by their probabilities for each topic. We can use this to look at the documents that are most likely to apply to any particular topic:

(defn rank-instances [model data topic-id]
  (let [get-p (fn [n]
                [(aget (.getTopicProbabilities model n) topic-id)
                 (nth data n)])]
    (->> data count range (map get-p) (sort-by first) reverse)))

We can use these functions—as well as a few others based on these—from the REPL to explore our data.

Generally, when deciding how many topics to use, we'll want to use some kind of objective metric to find a good definition of the sets. However, for exploring in an off-the-cuff way, we'll use something more subjective. First, after playing around with the number of topics, I chose to use a topic count of twelve. Since all of these are really about just one thing, UFO sightings, I didn't expect there to be too many meaningful topics, even at a fairly detailed, narrow level. At twelve topics, there still seemed to be some vague, less helpful topics, but the more interesting topics that I'd seen before were still there. When I attempted fewer topics, some of those interesting topics disappeared.

So to get started, let's see the topics and the top 10 words for each. Remember that the topic descriptions here aren't generated by the computer. I came up with them after looking at the top words and the top few descriptions for those topics. Some of these are not obvious, given the small sample of terms included here. However, diving further into the topic terms, the documents themselves gave these categorizations. In some cases, I've included notes in parentheses as follows:

  • Remembering childhood experiences: back time house craft car looked years remember road home
  • Lots of NUFORC notes, thanks to other organizations or local chapters: report witness nuforc note ufo sighting pd date reported object
  • Bright, silent objects in the sky: light sky bright lights white star object red moving looked
  • Visual descriptions: lights sky light time night red minutes objects back bright (this one doesn't have a clear topic as it's commonly defined)
  • White, red, and reddish-orange lights: light sky lights looked bright moving object back red white
  • Very fast, bright objects in the sky, compared to airplanes and meteors: lights sky object aircraft light west north appeared flying south
  • NUFORC notes. "Witness elects to remain totally anonymous": nuforc note pd witness date sky light anonymous remain approximate
  • Vague: ufo camera air object picture time pictures photo photos day (again, the subject of this topic isn't clear)
  • Objects in the sky, no lights, or not mentioned: object driving road car lights shaped craft looked side feet
  • Abductions, visitations, fear. Close encounters of the fourth kind: time night back looked light house thing window lights sound
  • Sightings. Moving in different directions: lights object craft light flying white north south east moving
  • Technical descriptions: object sky light moving objects appeared bright time high north

Several of these topics, for instance, the third, fifth, sixth, and ninth bullet, seem to be pretty generic descriptions of sightings. They describe lots of moving lights in the sky.

Other topics are more interesting. Topic one contained a number of descriptions written by people looking back at their childhood or college years. For instance, in the following paragraph, someone describes having a close encounter when they were about six years old. There are a number of spelling mistakes, and part of the reason I've kept it in is to illustrate just how messy this data can be:

Blus light, isolated road, possible missing timeI was six years old at the time, and even now, if I concentrate, I can recall what happened. My mother, her best friend, and myself were driving on a section of road called "Grange Road." Today, there are a lot of houses, but at the time, it was all farmland with maybe one or two houses. It was just after midnight, and I remember waking up. I was alseep in the back seat, and I woke up feeling very frightened. I sat up, and my mother and her friend were obviously worried. The car we were in was cutting in-and-out, and finally died. As soon as the car stopped, we all saw a blue light directly ahead, maybe about 20 feet off of the ground, and about a football field legnth away. It glided towards us, made no noise, and as it got to within 15 feet, it stopped in midair, hoovering. My mom grabbed me from the backseat and held on, and her friend was crying. I was crying, too, because whatever it was, it was making us all upset. After about five minutes, I don't recall what happened, because for whatever reason, I fell alseep. Weird, I know, but I swear it happened. I woke up sometime later, and we three were sitting there, shocked, and the light was gone. My mom and her friend - to this day - swear they had missing time, about 10 minutes worth. I hope this helps...((NUFORC Note: Witness indicates that date of sighting is approximate. PD))

And some topics are puzzling, number eight, for instance. The top 10 documents for it had nothing obvious that appeared to make them a coherent subject. There may be something about some of the subtler vocabulary selection that was getting identified, but it wasn't readily apparent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.217.187