Extracting the data

Before we go further, let's look at the following Leiningen 2 (http://leiningen.org/) project.clj file that we'll use for this chapter:

(defproject ufo-data "0.1.0-SNAPSHOT"
  :plugins [[lein-cljsbuild "0.3.2"]]
  :profiles {:dev {:plugins [[com.cemerick/austin "0.1.0"]]}}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [org.clojure/data.json "0.2.2"]
                 [org.clojure/data.csv "0.1.2"]
                 [clj-time "0.5.1"]
                 [incanter "1.5.2"]
                 [cc.mallet/mallet "2.0.7"]
                 [me.raynes/fs "1.4.4"]]
  :cljsbuild
    {:builds [{:source-paths ["src-cljs"],
               :compiler {:pretty-printer true,
                          :output-to "www/js/main.js",
                          :optimizations :whitespace}}]})

The preceding code shows that over the course of this chapter, we'll parse time with the clj-time library (https://github.com/clj-time/clj-time). This provides a rich, robust date and time library. We'll also use ClojureScript (https://github.com/clojure/clojurescript) for the visualizations.

Our first step in working with this data is to load it from the data file. To facilitate this, we'll read it into a record type that we'll define just to store the UFO sightings. We'll work with the model.clj file placed at src/ufo_data/. The following is a namespace declaration with the imports and requirements that we'll use in this module:

(ns ufo-data.model
  (:require [clojure.java.io :as io]
            [clojure.core.reducers :as r]
            [clojure.string :as str]
            [clojure.data.json :as json]
            [clj-time.format :as tf]
            [ufo-data.text :as t]
            [ufo-data.util :refer :all]
            [me.raynes.fs :as fs])
  (:import [java.lang StringBuffer]))

Now we'll define the record. It simply lists the same fields that we walked through earlier. We also include a few new fields. We'll use these to parse the year, month, and season from the reported_at field as follows:

(defrecord UfoSighting
  [sighted-at reported-at location shape duration description
   year month season])

Now, when we take a row from the TSV file, we'll need to parse it into one of these structures. Because each line of input only has six fields, we'll make sure that it's padded out to nine fields. We'll also verify that there are exactly six input fields. If there are more or less, we'll take steps to either further pad the fields or to join some of the fields, as shown in the following code:

(defn ->ufo [row]
  (let [row (cond
              (> (count row) 6)
                   (concat (take 5 row)
                      [(str/join 	 (drop 5 row))])
              (< (count row) 6)
                   (concat row (repeat (- 6 (count row)) nil))
              :else row)]
    (apply ->UfoSighting (concat row [nil nil nil]))))

Some of the fields (the most important ones, actually) are dates, and we'll want to parse them into valid date objects. To do this, we'll use the excellent clj-time library (https://github.com/clj-time/clj-time). This provides a more "Clojuresque" interface for the Joda time library (http://joda-time.sourceforge.net/). The code that does this takes a custom date format and attempts to parse the dates. If any fail, we just fall back on using nil. Look at the following code:

(def date-formatter (tf/formatter "yyyyMMdd"))
(defn read-date [date-str]
  (try
    (tf/parse date-formatter date-str)
    (catch Exception ex
      nil)))

We use the following function to coerce the raw string date fields into the more useful date objects that Joda time provides:

(defn coerce-fields [ufo]
  (assoc ufo
         :sighted-at (read-date (:sighted-at ufo))
         :reported-at (read-date (:reported-at ufo))))

That's all that we need to load the data. Now we can write the function that will actually take care of reading the data from the file on disk into a sequence of records, as follows:

(defn read-data
  [filename]
  (with-open [f (io/reader filename)]
    (->> (csv/read-csv f :separator 	ab)
      vec
      (r/map ->ufo)
      (r/map coerce-fields)
      (into []))))

Now that we can read in the data, we can start picking it apart and learn about the data that we have.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.71.94