Getting prepared with data

As usual, now we need to clean up the data and put it into a shape that we can work with. The news article dataset particularly will require some attention, so let's turn our attention to it first.

Working with news articles

The OANC is published in an XML format that includes a lot of information and annotations about the data. Specifically, this marks off:

  • Sections and chapters
  • Sentences
  • Words with part-of-speech lemma
  • Noun chunks
  • Verb chunks
  • Named entities

However, we want the option to use raw text later when the system is actually being used. Because of that, we will ignore the annotations and just extract the raw tokens. In fact, all we're really interested in is each document's text—either as a raw string or a feature vector—and the date it was published. Let's create a record type for this.

We'll put this into the types.clj file in src/financial/. Put this simple namespace header into the file:

(ns financial.types)

This data record will be similarly simple. It can be defined as follows:

(defrecord NewsArticle [title pub-date text])

So let's see what the XML looks like and what we need to do to get it to work with the data structures we just defined.

The Slate data is in the OANC-GrAF/data/written_1/journal/slate/ directory. The data files are spread through 55 subdirectories as follows:

$ ls d/OANC-GrAF/data/written_1/journal/slate/
. .. 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32
33 34 35 36 37 38 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 6 7 8
9

Digging in deeper, each document is represented by a number of files. From the slate directory, we can see the following details:

$ ls 1/Article247_99*
1/Article247_99-hepple.xml  1/Article247_99-s.xml       
1/Article247_99.txt
1/Article247_99-logical.xml 1/Article247_99-vp.xml
1/Article247_99-np.xml      1/Article247_99.anc

So we can see the different annotations files are the files with the xml extension. The ANC file contains metadata about the file. We'll need to access that file for the date and other information. But most importantly, there's also a .txt file containing the raw text of the document. That will make working with this dataset much easier!

But let's take a minute to write some functions that will help us work with each document's text and its metadata as an entity. These will represent the knowledge we've just gained about the directory and file structure of the OANC corpus.

We'll call this file src/financial/oanc.clj, and its namespace header should look as follows:

(ns financial.oanc
  (:require [clojure.data.xml :as xml]
            [clojure.java.io :as io]
            [clojure.string :as str]
            [me.raynes.fs :as fs]
            [clj-time.core :as clj-time]
            [clj-time.format :as time-format])
  (:use [financial types utils]))

If we examine the directory structure that the OANC uses, we can see that it's divided into a clear hierarchy. Let's trace that structure in the slate directory that we discussed earlier, OANC-GrAF/data/written_1/journal/slate/. In this example, written_1 represents a category, journal is a genre, and slate is a source. We can leverage this information as we walk the directory structure to get to the data files.

Our first bit of code contains four functions. Let's list them first, and then we can talk about them:

(defn list-category-genres [category-dir]
  (map #(hash-map :genre % :dirname (io/file category-dir %))
       (fs/list-dir category-dir)))
(defn list-genres [oanc-dir]
  (mapcat list-category-genres (ls (io/file oanc-dir "data"))))
(defn find-genre-dir [genre oanc-dir]
  (->> oanc-dir
    list-genres
    (filter #(= (:genre %) genre))
    first
    :dirname))
(defn find-source-data [genre source oanc-dir]
  (-> (find-genre-dir genre oanc-dir)
    (io/file source)
    (fs/find-files #".*.anc")))

The functions used in the preceding code are described as follows:

  • The first of these functions, list-category-genre, takes a category directory (OANC-GrAF/data/written_1/) and returns the genres that it contains. This could be journal, as in our example here, or fiction, letters, or a number of other options. Each item returned is a hash map of the full directory and the name of the genre.
  • The second function is list-genres. It lists all of the genres within the OANC data directory.
  • The third function is find-genre-dir. It looks for one particular genre and returns the full directory for it.
  • Finally, we have find-source-data. This takes a genre and source and lists all of the files with an anc extension.

Using these functions, we can iterate over the documents for a source. We can see how to do that in the next function, find-slate-files, which returns a sequence of maps pointing to each document's metadata ANC file and to its raw text file, as shown in the following code:

(defn find-slate-files [oanc-dir]
  (map #(hash-map :anc % :txt (chext % ".txt"))
       (find-source-data "journal" "slate" oanc-dir)))

Now we can get at the metadata in the ANC file. We'll use the clojure.data.xml library to parse the file, and we'll define a couple of utility functions to make descending into the file easier. Look at the following code:

(defn find-all [xml tag-name]
  (lazy-seq
    (if (= (:tag xml) tag-name)
      (cons xml (mapcat #(find-all % tag-name) (:content xml)))
      (mapcat #(find-all % tag-name) (:content xml)))))
(defn content-str [xml]
  (apply str (filter string? (:content xml))))

The first utility function, find-all, lazily walks the XML document and returns all elements with a given tag name. The second function, content-str, returns all the text children of a tag.

Also, we'll need to parse the date from the pubDate elements. Some of these have a value attribute, but this isn't consistent. Instead, we'll parse the elements' content directly using the clj-time library (https://github.com/clj-time/clj-time), which is a wrapper over the Joda time library for Java (http://joda-time.sourceforge.net/). From our end, we'll use a few functions.

Before we do, though, we'll need to define a date format string. The dates inside the pubDate functions look like 2/13/97 4:30:00 PM. The formatting string, then, should look as follows:

(def date-time-format
     (time-format/formatter "M/d/yyyy h:mm:ss a"))

We can use this formatter to pull data out of a pubDate element and parse it into an org.joda.time.DateTime object as follows:

(defn parse-pub-date [pub-date-el]
  (time-format/parse date-time-format (content-str pub-date-el)))

Unfortunately, some of these dates are about 2000 years off. We can normalize the dates and correct these errors fairly quickly, as shown in the following code:

(defn norm-date [date]
  (cond
    (= (clj-time/year date) 0)
      (clj-time/plus date (clj-time/years 2000))
    (< (clj-time/year date) 100)
      (clj-time/plus date (clj-time/years 1900))
    :else date))

With all of these parts in place, we can write a function that takes the XML from an ANC file and returns date and time for the publication date as follows:

(defn find-pub-date [anc-xml]
  (-> anc-xml
    (find-all :pubDate)
    first
    parse-pub-date
    norm-date))

The other piece of data that we'll load from the ANC metadata XML is the title. We get that from the title element, of course, as follows:

(defn find-title [anc-xml]
  (content-str (first (find-all anc-xml :title))))

Now, loading a NewsArticle object is straightforward. In fact, it's so simple that we'll also include a version of this that reads in the text from a plain file. Look at the following code:

(defn load-article [data-info]
  (let [{:keys [anc txt]} data-info
        anc-xml (xml/parse (io/reader anc))]
    (->NewsArticle (find-title anc-xml)
                   (find-pub-date anc-xml)
                   (slurp txt))))
(defn load-text-file [data filename]
  (->NewsArticle filename date (slurp filename)))

And using these functions to load all of the Slate articles just involves repeating the earlier steps, as shown in the following commands:

user=> (def articles (doall (map oanc/load-article
                                 (oanc/find-slate-files
                                   (io/file "d/OANC-GrAF")))))
user=> (count articles)
4531
user=> (let [a (first articles)]
         [(:title a) (:pub-date a) (count (:text a))])
["Article247_4" #<DateTime 1999-03-09T07:47:21.000Z> 3662]

The last command in the preceding code just prints the title, publication date, and the length of the text in the document.

With these functions in place, we now have access to the article dataset.

Working with stock data

Loading the news articles was complicated. Fortunately, the stock price data is in comma-separated values (CSV) format. Although not the richest data format, it is very popular, and clojure.data.csv (https://github.com/clojure/data.csv/) is an excellent library for loading it.

As I just mentioned, though, CSV isn't the richest data format. We will want to convert this data into a richer format, so we'll still create a record type and some wrapper functions to make it easier to work with the data as we read it in.

The data in this will closely follow the columns in the CSV file that we downloaded from Google Finance earlier. Open src/financial/types.clj again and add the following line to represent the data type for the stock data:

(defrecord StockData [date open high low close volume])

For the rest of the code in this section, we'll use a new namespace. Open the src/financial/cvs_data.clj file and add the following namespace declaration:

(ns financial.csv-data
  (:require [clojure.data.csv :as csv]
            [clojure.java.io :as io]
            [clj-time.core :as clj-time]
            [clj-time.format :as time-format])
  (:use [financial types utils]))

Just like the Slate news article data, this data also has a field with a date, which we'll need to parse. Unlike the Slate data, this value is formatted differently. Glancing at the first few lines of the file gives us all the information that we need, as follows:

Date,Open,High,Low,Close,Volume
29-Dec-00,33.47,33.56,33.09,33.50,857800
28-Dec-00,33.62,33.62,32.94,33.47,961200
27-Dec-00,33.50,33.97,33.19,33.56,992400
26-Dec-00,32.88,33.69,32.88,33.62,660600

To parse dates in this format (29-Dec-00), we can use the following format specification:

(def date-format (time-format/formatter "d-MMM-YY"))

Now, we build on this and a few other function—which you can find in the code download in the file src/financial/utils.clj—to create a StockData instance from a row of data, as shown in the following code:

(defn row->StockData [row]
  (let [[date open high low close vol] row]
    (->StockData (time-format/parse date-format date)
                 (->double open)
                 (->double high)
                 (->double low)
                 (->double close)
                 (->long vol))))

This is all straightforward. Basically, every value in the row must be converted to a native Clojure/Java type, and then all of those values are used to create the StockData instance.

To read in an entire file, we just do this for every row returned by the CSV library as follows:

(defn read-stock-prices [filename]
  (with-open [f (io/reader filename)]
    (doall (map row->StockData (drop 1 (csv/read-csv f))))))

The only wrinkle is that we have to drop the first row, since it's the header.

And now, to load the data, we just call the following function (we've aliased the financial.csv-data namespace to csvd):

user=> (def sp (csvd/read-stock-prices "d/d-1995-2001.csv"))
user=> (first sp)
#financial.types.StockData{:date #<DateTime 2000-12-29T00:00:00.000Z>,
   :open 33.47, :high 33.56, :low 33.09, :close 33.5, :volume 857800}
user=> (count sp)
1263

Everything appears to be working correctly. Let's turn our attention back to the news article dataset and begin analyzing it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.37.250