Hoaxes

One of the most interesting finds in this was topic seven. This topic was focused on annotations added to the descriptions for which the witnesses wished to remain anonymous. But its most likely document was the following:

Round, lighted object over Shelby, NC, hovered then zoomed away. It was my birthday party and me and my friends were walking around the block about 21:30. I just happened to look up and I saw a circular object with white and bright blue lights all over the bottom of it. It hovered in place for about 8 seconds then shot off faster than anything I have ever seen.((NUFORC Note: Witness elects to remain totally anonymous; provides no contact information. Possible hoax?? PD))((NUFORC Note: Source of report indicates that the date of the sighting is approximate. PD))

What caught my attention was the note "Possible hoax??" Several other descriptions in this topic had similar notes, often including the word hoax.

Finding this raised an interesting possibility: could we train a classifier to recognize possible hoaxes? My initial reaction was to be skeptical. But I still thought it would be an interesting experiment.

Eventually, we'll want to load this data and process it with MALLET (http://mallet.cs.umass.edu/). MALLET works a little easier with data that's kept in a particular directory format. The template for this is base-directory/tag/data-file.txt. In fact, we'll include a directory above these, and for base-directory, we'll define a directory for training data and one for test data.

The training group is used to train the classifier, and the test group is used to evaluate the classifier after it's been trained in order to determine how successful it is. Having two different groups for these tasks helps to find whether the classifier is over-fitting, that is, whether it has learned the training group so well that it performs poorly on new data.

Preparing the data

So before we get started, we'll preprocess the data to put it into a directory structure such as src/ufo_data/. All the code for this will go into the model.clj file. The namespace declaration for this is as follows:

(ns ufo-data.model
  (:require [clojure.java.io :as io]
            [clojure.core.reducers :as r]
            [clojure.string :as str]
            [clojure.data.json :as json]
            [clj-time.format :as tf]
            [ufo-data.text :as t]
            [ufo-data.util :refer :all]
            [me.raynes.fs :as fs])
  (:import [java.lang StringBuffer]))

Now, to process this dataset into a form that MALLET can deal with easily, we're going to put it through the following steps:

  1. Read the data into a sequence of data records.
  2. Split out the NUFORC comments.
  3. Categorize the documents based on the comments.
  4. Partition them into directories based on the categories.
  5. Divide them into training and test sets.

Let's see how we'll put these together.

Reading the data into a sequence of data records

The data in the downloaded file has a number of problems with values that can't be escaped properly. I've cleaned this up and made a new data file, available at http://www.ericrochester.com/clj-data-master/data/ufo.json. I've saved this into my data directory and bound that path to the name *data-file*. You can find this and a few other definitions in the code download for this chapter.

But primarily, I'd like to focus on the data record for a minute. This just contains the fields from the JSON objects being read in. The following definition will serve as documentation of our data and make working with the rows a little easier:

(defrecord UfoSighting
  [sighted-at reported-at location shape duration description
   year month season])

The data as we read it in from the JSON file won't be quite right, however. We'll still need to convert date strings into data objects. We'll do that with read-date, which parses a single date string, and with coerce-fields, which coordinates the calling of read-date on the appropriate fields in UfoSighting, as shown in the following code:

(def date-formatter (tf/formatter "yyyyMMdd"))
(defn read-date [date-str]
  (try
    (tf/parse date-formatter date-str)
    (catch Exception ex
      nil)))
(defn coerce-fields [ufo]
  (assoc ufo
         :sighted-at (read-date (:sighted-at ufo))
         :reported-at (read-date (:reported-at ufo))))

Now we can use these functions to read and parse each line of the input data file. As shown in the following code, each line is a separate JSON object:

(defn read-data
  ([] (read-data *data-file*))
  ([filename]
   (with-open [f (io/reader filename)]
     (->> f
       line-seq
       vec
       (r/map #(json/read-str % :key-fn keyword))
       (r/map map->UfoSighting)
       (r/map coerce-fields)
       (into [])))))

Now we can use these on the REPL to load the data file. As shown in the following code, in this session, model is bound to ufo-data.model:

user=> (def data (model/read-data))
user=> (count data)
61067
user=> (first data)
{:sighted-at nil,
 :reported-at nil,
 :location " Iowa City, IA",
 :shape "",
 :duration "",
 :description
 "Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
 :year nil,
 :month nil,
 :season nil,
 :reported_at "19951009",
 :sighted_at "19951009"}

Looks good. We're ready to start processing the descriptions further.

Splitting the NUFORC comments

Many of the descriptions contain comments by NUFORC (http://www.nuforc.org/). These contain editorial remarks – some of them about the authenticity of the report. The following is a sample description with NUFORC commentary:

Telephoned Report:Husband and wife were awakened by a very bright light outside their house in Rio Vista area of McCall. It was so bright, it was "like being inside a football stadium." No sound. Ground was covered with snow at the time. It lasted for 10 seconds.((NUFORC Note: We spoke with the husband and wife, and found them to be quite credible and convincing in their description of what they allegedly had seen. Both have responsible jobs. PD))

This is a standard format for these comments: They're enclosed in double parentheses and begin with "NUFORC." We can leverage this information, and a regular expression, to pull all the notes out of the document.

To do this, we'll go a little deeper into the Java regular expression API than Clojure has utility functions defined to do. Let's see what we need to do, and then we can take it apart after the following code listing:

(defn split-nuforc [text]
  (let [m (.matcher #"((.*?))" text), sb (StringBuffer.)]
    (loop [accum []]
      (if (.find m)
        (let [nuforc (.substring text (.start m) (.end m))]
          (.appendReplacement m sb "")
          (recur (conj accum nuforc)))
        (do
          (.appendTail m sb)
          [(str sb) (str/join " " accum)])))))

So first we create a regular expression that picks out text enclosed in double parentheses. We also create java.lang.StringBuffer. We'll use this to accumulate the description of the UFO sighting, with the NUFORC comments stripped out.

The body of the function is a loop that has a single parameter, a vector named accum. This will accumulate the NUFORC comments.

Inside the loop, every time the regular expression finds a match, we extract the NUFORC comment out of the original string and replace the match with an empty string in StringBuffer. Finally, when there are no more matches on the regular expression, we append the rest of the string onto StringBuffer, and we can retrieve its contents and the comments, joined together.

Let's see what happens when we strip the NUFORC comments from the description quoted earlier:

user=> (def split-descr (model/split-nuforc description))
user=> (first split-descr)
"Telephoned Report:Husband and wife were awakened by a very bright light outside their house in Rio Vista area of McCall.  It was so bright, it was "like being inside a football stadium."  No sound.  Ground was covered with snow at the time.  It lasted for 10 seconds."
user=> (second split-descr)
"((NUFORC Note:  We spoke with the husband and wife, and found them to be quite credible and convincing in their description of what they allegedly had seen.   Both have responsible jobs.  PD))"

So we can see that the first item in the pair returned by split-nuforc contains the description by itself, and the second item is the comments.

Now we can use the comments to categorize the descriptions in the first part. And we'll use that to figure out where to save the cleaned-up descriptions.

Categorizing the documents based on the comments

Categorizing the documents is relatively easy. We'll use a tokenize function, which can be found in the code download for this chapter, in the namespace ufo-data.text (which is aliased to t in the code). We can convert the words in the comment to a set of tokens and then look for the word "hoax". If found, we'll categorize it as follows:

(defn get-category [tokens]
  (if (contains? (set tokens) "hoax")
    :hoax
    :non-hoax))

When called with the tokens of a comment, it returns the category of the description as follows:

user=> (model/get-category
         (map t/normalize (t/tokenize (second split-descr))))
:non-hoax

Of course, this is very rough, but it should be all right for this experiment.

Partitioning the documents into directories based on the categories

Now that they're in categories, we can use those categories to save the descriptions into files. Each description will be in its own file.

Initially, we'll put all of the files into one pair of directories. In the next step, we'll divide them further into test and training sets.

The first function for this section will take a base directory, a number, and the document pair, as returned by ufo-data.model/split-nuforc. From there, it will save the text to a file and return the file's category and filename, as shown in the following code:

(defn save-document [basedir n doc]
  (let [[text category] doc
        filename (str basedir / (name category) / n ".txt")]
    (spit filename text)
    {:category category, :filename filename}))

The next function, make-dirtree-sighting, will do a lot of the work. It will take an instance of UfoSighting and will split out the NUFORC commentary, tokenize both parts, get the category, and use it to save the filename, as shown in the following code:

(defn make-dirtree-sighting
  ([basedir]
   (fn [sighting n]
     (make-dirtree-sighting basedir sighting n)))
  ([basedir sighting n]
   (->> sighting
     :description
     split-nuforc
     (on-both #(map t/normalize (t/tokenize %)))
     (on-second get-category)
     (on-first #(str/join " " %))
     (save-document basedir n))))

This will handle saving each file individually into one pair of directories: one for hoaxes and one for non-hoaxes. We'll want to process all of the UFO sightings, however, and we'll want to divide the two sets of documents into a test set and a training set. We'll do all of this in the next section.

Dividing them into training and test sets

Now, we can divide the data that we have into a training set and a test set. We'll need the following two utility functions to do this:

  1. We'll need to create subdirectories for the categories several times. Let's put that into the following function:
    (defn mk-cat-dirs [base]
      (doseq [cat ["hoax" "non-hoax"]]
        (fs/mkdirs (fs/file base cat))))
  2. We'll also need to divide a collection into two groups by ratio, as shown in the following code. That is, one subgroup will be 80 percent of the original and the other subgroup will be 20 percent of the original.
    (defn into-sets [ratio coll]
      (split-at (int (* (count coll) ratio)) coll))

Now, the function to move a collection of files into a stage's subdirectory (testing or training) will be mv-stage. The collection of files is generated by save-document, so it's a collection of maps, each containing the category and filename of the file, as shown in the following code:

(defn mv-stage [basedir stage coll]
  (let [stage-dir (fs/file basedir stage)]
    (doseq [{:keys [category filename]} coll]
      (fs/copy filename
               (fs/file stage-dir (name category)
                        (fs/base-name filename))))))

To control this whole process, we'll use make-dirtree. This will take a collection of instances of UfoSighting and process them into separate text files. All of the files will be in the basedir directory, and then they'll be divided into a training set and a test set. These will be put into sibling directories under basedir as shown in the following code:

(defn make-dirtree [basedir training-ratio sightings]
  (doseq [dir [basedir (fs/file basedir "train")
               (fs/file basedir "test")]]
    (mk-cat-dirs dir))
  (let [outputs (map (make-dirtree-sighting basedir)
                     sightings (range))
        {:keys [hoax non-hoax]} (group-by :category
                                          (shuffle outputs))
        [hoax-train hoax-test] (into-sets training-ratio hoax)
        [nhoax-train nhoax-test] (into-sets
                                   training-ratio non-hoax)]
    (mv-stage basedir "train" (concat hoax-train nhoax-train))
    (mv-stage basedir "test" (concat hoax-test nhoax-test))))

Now, let's use this to divide out sightings data into groups and save them into the bayes-data directory as follows:

user=> (model/make-dirtree "bayes-data" 0.8 data)

We have the data now, and it's in a shape that MALLET can use. Let's look at how we're going to leverage that library for Naïve Bayesian classification.

Classifying the data

Bayesian inference can seem off-putting at first, but at its most basic level, it's how we tend to deal with the world. We start out with an idea of how likely something is, and then we update that expectation as we receive more information. In this case, depending on our background, training, history, and tendencies, we may think that all UFO reports are hoaxes or that most of them are. We may think that few UFO reports are hoaxes, or we may be completely undecided and assume that about half of them are hoaxes and half are true. But as we hear reports that we know the truth of, we change our opinions and expectations of the other reports. We may notice patterns, too. Hoaxes may talk about green men, while true reports may talk about grays. So you may also further refine your intuition based on that. Now, when you see a report that talks about little green men, you're more likely to think it's a hoax than when you see a report that talks about little gray men.

You may also notice that triangular UFOs are considered hoaxes, while circular UFOs are not. Now, when you read another document, this observation then further influences your beliefs about whether that document is a hoax or not.

In Bayesian terms, our original expectation that a document is a hoax or not is called the prior or assumed probability, and its notation is P(H), where H is the probability that the document is considered a hoax. The updated expectation after seeing the color of the aliens in the description, C, is called the conditional probability, and its notation is P(C|H), which is read as the probability of C given H. In this case, it's the probability distribution over the alien's color, given that the document is a hoax.

Bayes' theorem is a way of swapping the conditions for a set of conditional probabilities. That is, we can now find P(H|C), or the probability distribution over the document's being a hoax, given that the alien is green or gray.

The formula to do this is pretty simple. To compute the probability that the document is a hoax, given the aliens' color, consider the following conditions:

  • The probability of the aliens' color, given that the document is a hoax or not
  • The probability that the document is a hoax
  • The probability of the aliens' color.

For Naïve Bayesian classification, we make an important assumption: we assume that the features in a document are independent. This means that the probability that whether aliens are green or gray in a document is independent of whether the UFO is a disk or a triangle.

In spite of this assumption, Naïve Bayesian classifiers often work well in the real world. We can train them easily and quickly, and they classify new data quickly and often perform well enough to be useful.

So with that understanding, let's look at how MALLET handles Naïve Bayesian classification.

Coding the classifier interface

Before we begin the next part of this chapter, it's probably a good time to start a new namespace for the following code to live in. Let's put it into the src/ufo_data/bayes.clj file. The ns declaration is as follows:

(ns ufo-data.bayes
  (:require [clojure.java.io :as io])
  (:import [cc.mallet.util.*]
           [cc.mallet.types InstanceList]
           [cc.mallet.pipe Input2CharSequence
            TokenSequenceLowercase
            TokenSequenceRemoveStoplist
            CharSequence2TokenSequence SerialPipes
            SaveDataInSource Target2Label
            TokenSequence2FeatureSequence
            FeatureSequence2AugmentableFeatureVector]
           [cc.mallet.pipe.iterator FileIterator]
           [cc.mallet.classify NaiveBayesTrainer]
           [java.io ObjectInputStream ObjectOutputStream]))

With the preceding code in place, let's see what we need to do.

Setting up the Pipe and InstanceList

MALLET processes all input through Pipe. Pipes represent a series of transformations over the text. When you're working with a classifier, the data that's used for training, testing, and later for classifying new documents, all need to be put through the same pipe of processes. Also, all of them must use the same set of features and labels. MALLET calls these alphabets.

Each data document, at whatever stage of processing, is stored in an Instance object, and corpora of these are kept in InstanceList. Pipe objects are associated with InstanceList objects. This makes sure that all Instance objects in a collection are processed consistently.

In order to keep things straight, we'll define make-pipe-list. This will create the Pipe object as shown in the following code:

(defn make-pipe-list []
  (SerialPipes.
    [(Target2Label.)
     (SaveDataInSource.)
     (Input2CharSequence. "UTF-8")
     (CharSequence2TokenSequence. #"p{L}[p{L}p{P}]+p{L}")
     (TokenSequenceLowercase.)
     (TokenSequenceRemoveStoplist.)
     (TokenSequence2FeatureSequence.)
     (FeatureSequence2AugmentableFeatureVector. false)]))

This processing pipeline performs the following steps:

  1. Target2Label takes the category from the directory path and assigns it to the Instance object's label. Labels are the categories or classes used for classification.
  2. SaveDataInSource takes the path name, which is currently in the data property, and puts it into the Instance object's source property.
  3. Input2CharSequence reads the data from the filename and replaces it with the file's contents.
  4. CharSequence2TokenSequence tokenizes the file's contents.
  5. TokenSequenceLowercase converts all uppercase characters in the tokens to lowercase.
  6. TokenSequenceRemoveStoplist removes common English words so that the classifier can focus on content words.
  7. TokenSequence2FeatureSequence categorizes the tokens as sequences. Each unique word is assigned a unique integer identifier.
  8. FeatureSequence2AugmentableFeatureVector converts the sequence of tokens into a vector. The token's feature identifier is that token's index in the feature vector.

MALLET's classifier expects feature vectors as input, so this is the appropriate pipeline to use.

Now we need to take an input directory, generate Instance objects from it, and associate their processing with a pipeline. In the following code, we'll use the add-input-directory function to do all of that:

(defn add-input-directory [dir-name pipe]
  (doto (InstanceList. pipe)
    (.addThruPipe
      (FileIterator. (io/file dir-name)
                     #".*/([^/]*?)/d+.txt$"))))

The regular expression in the last line takes the name of the file's directory and uses that as the Instance object's classification. We can use these two functions to handle the loading and processing of the inputs.

Training

Training is pretty simple. We create an instance of NaiveBayesTrainer. Its train method returns an instance of NaiveBayes, which is the classifier. We'll wrap this in the following function to make it slightly easier to use:

(defn train [instance-list]
  (.train (NaiveBayesTrainer.) instance-list))

Wrapping it in this way provides a Clojure-native way of dealing with this library. It also keeps users of our module from needing to import NaiveBayesTrainer and the other classes from MALLET directly.

Classifying

Just like training, classifying is also easy. The classifier returned by the train function just defers to the classify method as follows:

(defn classify [bayes instance-list]
  (.classify bayes instance-list))

The preceding code will return an instance of type cc.mallet.classify.Classification. This returns not only the best label and the probabilities associated with it, but also the probabilities of the other labels and the classifier and document instance involved.

Validating

We can now train a classifier and run it on new documents. We'd like to be able to test it as well, by comparing our expectations from preclassified documents with how the classifier actually performs.

At the lowest level, we'll want to compare the expected classification with the actual classification and keep a count of each pairing of these values. We can do that with validate1. This gets the expected and actual labels, and it creates a vector pair of them. The confusion-matrix function then gets the frequency of those pairs as follows:

(defn validate1 [bayes instance]
  (let [c (.classify bayes instance)
        expected (.. c getInstance getTarget toString)
        actual (.. c getLabeling getBestLabel toString)]
    [expected actual]))
(defn confusion-matrix [classifier instances labels]
  (frequencies (map #(validate1 classifier %) instances)))

A confusion matrix is a table with the counts of the correctly classified instances (expected and actual match), the false positives (expected is to not classify, but the actual is to classify it), and the false negatives (expected is to classify the instance, but the actual is to not classify it). This provides an easy-to-comprehend overview of the performance of a classifier.

Tying it all together

In the following code, we'll create a bayes function that creates, trains, and tests a classifier on a directory of data. It will take the hash map of information returned by validate and add the classifier and the Pipe object to it. Having the pipe object available later will be necessary to run the classifier on more data in the future.

(defn bayes [training-dir testing-dir]
  (let [pipe (make-pipe-list)
        training (add-input-directory training-dir pipe)
        testing (add-input-directory testing-dir pipe)
        classifier (train training)
        labels (iterator-seq
                 (.iterator (.getLabelAlphabet classifier)))
        c-matrix (confusion-matrix classifier testing labels)]
    {:bayes classifier
     :pipe pipe
     :confusion c-matrix}))

Now that we have all the pieces in place, let's see how to run the classifier.

Running the classifier and examining the results

For this section, I've loaded the ufo-data.bayes namespace into the REPL and aliased it with the name bayes.

We can pass to the bayes function the test and training directories that we created from the sightings as shown in the following code:

user=> (def bayes-out
         (bayes/bayes "bayes-data/train" "bayes-data/test"))
user=> (:confusion bayes-out)
{["hoax" "non-hoax"] 83, ["non-hoax" "non-hoax"] 12102,
["non-hoax" "hoax"] 29}

Let's put this into a more traditional form for this information. The expected values have their labels across the top of the table. The actual values have theirs down the side. Look at the following table:

 

Hoax

Non-hoax

Hoax

0

31

Non-hoax

83

12100

Well, that seems pretty useless. Evidently, my previous skepticism was warranted. The classifier managed to identify no hoaxes correctly, and it incorrectly identified 31 non-hoaxes as hoaxes (false positives).

But that's not all that we can learn about this. Instances of NaiveBayes also include a way to print out the top-weighted words for each category. Let's see what the top 10 words for each classification are:

user=> (.printWords (:bayes bayes-out) 10)

Feature probabilities hoax
apos 0.002311333180377461
lights 0.0022688454380911096
light 0.00217537240506114
object 0.0020988944689457082
sky 0.002081899372031169
quot 0.0015295587223086145
looked 0.0014360856892786434
craft 0.0011556665901887302
red 0.0011301739448169206
back 0.0010961837509878402

Feature probabilities non-hoax
object 0.016553223428401043
light 0.016198059821948316
apos 0.015460989114397925
lights 0.014296272431730976
sky 0.014028337606877127
quot 0.010350232305991571
bright 0.007963812802535785
time 0.007237239541481537
moving 0.007063281856688359
looked 0.007037538118852588

So the terms are in slightly different order, but the vocabulary describing hoaxes and non-hoaxes is almost identical. Both mention object, light, lights, sky, and looked. So, based on the features we've selected here (single-word tokens), it's not surprising that we didn't get good results.

However, the primary thing that we can learn is that hoaxes are considered to be extremely rare, and the decision that a sighting is a hoax or not is often based on external data. Consider the sighting quoted earlier. To support the judgment that the sighting is not a hoax, the commenter mentions that they have a stable job, even though that's not mentioned in the description itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.239.44