As usual, now we need to clean up the data and put it into a shape that we can work with. The news article dataset particularly will require some attention, so let's turn our attention to it first.
The OANC is published in an XML format that includes a lot of information and annotations about the data. Specifically, this marks off:
However, we want the option to use raw text later when the system is actually being used. Because of that, we will ignore the annotations and just extract the raw tokens. In fact, all we're really interested in is each document's text—either as a raw string or a feature vector—and the date it was published. Let's create a record type for this.
We'll put this into the types.clj
file in src/financial/
. Put this simple namespace header into the file:
(ns financial.types)
This data record will be similarly simple. It can be defined as follows:
(defrecord NewsArticle [title pub-date text])
So let's see what the XML looks like and what we need to do to get it to work with the data structures we just defined.
The Slate data is in the OANC-GrAF/data/written_1/journal/slate/
directory. The data files are spread through 55 subdirectories as follows:
$ ls d/OANC-GrAF/data/written_1/journal/slate/ . .. 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 6 7 8 9
Digging in deeper, each document is represented by a number of files. From the slate
directory, we can see the following details:
$ ls 1/Article247_99* 1/Article247_99-hepple.xml 1/Article247_99-s.xml 1/Article247_99.txt 1/Article247_99-logical.xml 1/Article247_99-vp.xml 1/Article247_99-np.xml 1/Article247_99.anc
So we can see the different annotations files are the files with the xml
extension. The ANC file contains metadata about the file. We'll need to access that file for the date and other information. But most importantly, there's also a .txt
file containing the raw text of the document. That will make working with this dataset much easier!
But let's take a minute to write some functions that will help us work with each document's text and its metadata as an entity. These will represent the knowledge we've just gained about the directory and file structure of the OANC corpus.
We'll call this file src/financial/oanc.clj
, and its namespace header should look as follows:
(ns financial.oanc (:require [clojure.data.xml :as xml] [clojure.java.io :as io] [clojure.string :as str] [me.raynes.fs :as fs] [clj-time.core :as clj-time] [clj-time.format :as time-format]) (:use [financial types utils]))
If we examine the directory structure that the OANC uses, we can see that it's divided into a clear hierarchy. Let's trace that structure in the slate
directory that we discussed earlier, OANC-GrAF/data/written_1/journal/slate/
. In this example, written_1
represents a category, journal
is a genre, and slate
is a source. We can leverage this information as we walk the directory structure to get to the data files.
Our first bit of code contains four functions. Let's list them first, and then we can talk about them:
(defn list-category-genres [category-dir] (map #(hash-map :genre % :dirname (io/file category-dir %)) (fs/list-dir category-dir))) (defn list-genres [oanc-dir] (mapcat list-category-genres (ls (io/file oanc-dir "data")))) (defn find-genre-dir [genre oanc-dir] (->> oanc-dir list-genres (filter #(= (:genre %) genre)) first :dirname)) (defn find-source-data [genre source oanc-dir] (-> (find-genre-dir genre oanc-dir) (io/file source) (fs/find-files #".*.anc")))
The functions used in the preceding code are described as follows:
list-category-genre
, takes a category directory (OANC-GrAF/data/written_1/
) and returns the genres that it contains. This could be journal
, as in our example here, or fiction, letters, or a number of other options. Each item returned is a hash map of the full directory and the name of the genre.list-genres
. It lists all of the genres within the OANC data directory.find-genre-dir
. It looks for one particular genre and returns the full directory for it.find-source-data
. This takes a genre and source and lists all of the files with an anc
extension.Using these functions, we can iterate over the documents for a source. We can see how to do that in the next function, find-slate-files
, which returns a sequence of maps pointing to each document's metadata ANC file and to its raw text file, as shown in the following code:
(defn find-slate-files [oanc-dir] (map #(hash-map :anc % :txt (chext % ".txt")) (find-source-data "journal" "slate" oanc-dir)))
Now we can get at the metadata in the ANC file. We'll use the clojure.data.xml
library to parse the file, and we'll define a couple of utility functions to make descending into the file easier. Look at the following code:
(defn find-all [xml tag-name] (lazy-seq (if (= (:tag xml) tag-name) (cons xml (mapcat #(find-all % tag-name) (:content xml))) (mapcat #(find-all % tag-name) (:content xml))))) (defn content-str [xml] (apply str (filter string? (:content xml))))
The first utility function, find-all
, lazily walks the XML document and returns all elements with a given tag name. The second function, content-str
, returns all the text children of a tag.
Also, we'll need to parse the date from the pubDate
elements. Some of these have a value
attribute, but this isn't consistent. Instead, we'll parse the elements' content directly using the clj-time
library (https://github.com/clj-time/clj-time), which is a wrapper over the Joda time library for Java (http://joda-time.sourceforge.net/). From our end, we'll use a few functions.
Before we do, though, we'll need to define a date format string. The dates inside the pubDate
functions look like 2/13/97 4:30:00 PM. The formatting string, then, should look as follows:
(def date-time-format (time-format/formatter "M/d/yyyy h:mm:ss a"))
We can use this formatter to pull data out of a pubDate
element and parse it into an org.joda.time.DateTime
object as follows:
(defn parse-pub-date [pub-date-el] (time-format/parse date-time-format (content-str pub-date-el)))
Unfortunately, some of these dates are about 2000 years off. We can normalize the dates and correct these errors fairly quickly, as shown in the following code:
(defn norm-date [date] (cond (= (clj-time/year date) 0) (clj-time/plus date (clj-time/years 2000)) (< (clj-time/year date) 100) (clj-time/plus date (clj-time/years 1900)) :else date))
With all of these parts in place, we can write a function that takes the XML from an ANC file and returns date and time for the publication date as follows:
(defn find-pub-date [anc-xml] (-> anc-xml (find-all :pubDate) first parse-pub-date norm-date))
The other piece of data that we'll load from the ANC metadata XML is the title. We get that from the title
element, of course, as follows:
(defn find-title [anc-xml] (content-str (first (find-all anc-xml :title))))
Now, loading a NewsArticle
object is straightforward. In fact, it's so simple that we'll also include a version of this that reads in the text from a plain file. Look at the following code:
(defn load-article [data-info] (let [{:keys [anc txt]} data-info anc-xml (xml/parse (io/reader anc))] (->NewsArticle (find-title anc-xml) (find-pub-date anc-xml) (slurp txt)))) (defn load-text-file [data filename] (->NewsArticle filename date (slurp filename)))
And using these functions to load all of the Slate articles just involves repeating the earlier steps, as shown in the following commands:
user=> (def articles (doall (map oanc/load-article (oanc/find-slate-files (io/file "d/OANC-GrAF"))))) user=> (count articles) 4531 user=> (let [a (first articles)] [(:title a) (:pub-date a) (count (:text a))]) ["Article247_4" #<DateTime 1999-03-09T07:47:21.000Z> 3662]
The last command in the preceding code just prints the title, publication date, and the length of the text in the document.
With these functions in place, we now have access to the article dataset.
Loading the news articles was complicated. Fortunately, the stock price data is in
comma-separated values (CSV) format. Although not the richest data format, it is very popular, and clojure.data.csv
(https://github.com/clojure/data.csv/) is an excellent library for loading it.
As I just mentioned, though, CSV isn't the richest data format. We will want to convert this data into a richer format, so we'll still create a record type and some wrapper functions to make it easier to work with the data as we read it in.
The data in this will closely follow the columns in the CSV file that we downloaded from Google Finance earlier. Open src/financial/types.clj
again and add the following line to represent the data type for the stock data:
(defrecord StockData [date open high low close volume])
For the rest of the code in this section, we'll use a new namespace. Open the src/financial/cvs_data.clj
file and add the following namespace declaration:
(ns financial.csv-data (:require [clojure.data.csv :as csv] [clojure.java.io :as io] [clj-time.core :as clj-time] [clj-time.format :as time-format]) (:use [financial types utils]))
Just like the Slate news article data, this data also has a field with a date, which we'll need to parse. Unlike the Slate data, this value is formatted differently. Glancing at the first few lines of the file gives us all the information that we need, as follows:
Date,Open,High,Low,Close,Volume 29-Dec-00,33.47,33.56,33.09,33.50,857800 28-Dec-00,33.62,33.62,32.94,33.47,961200 27-Dec-00,33.50,33.97,33.19,33.56,992400 26-Dec-00,32.88,33.69,32.88,33.62,660600
To parse dates in this format (29-Dec-00), we can use the following format specification:
(def date-format (time-format/formatter "d-MMM-YY"))
Now, we build on this and a few other function—which you can find in the code download in the file src/financial/utils.clj
—to create a StockData
instance from a row of data, as shown in the following code:
(defn row->StockData [row] (let [[date open high low close vol] row] (->StockData (time-format/parse date-format date) (->double open) (->double high) (->double low) (->double close) (->long vol))))
This is all straightforward. Basically, every value in the row must be converted to a native Clojure/Java type, and then all of those values are used to create the StockData
instance.
To read in an entire file, we just do this for every row returned by the CSV library as follows:
(defn read-stock-prices [filename] (with-open [f (io/reader filename)] (doall (map row->StockData (drop 1 (csv/read-csv f))))))
The only wrinkle is that we have to drop the first row, since it's the header.
And now, to load the data, we just call the following function (we've aliased the financial.csv-data
namespace to csvd
):
user=> (def sp (csvd/read-stock-prices "d/d-1995-2001.csv")) user=> (first sp) #financial.types.StockData{:date #<DateTime 2000-12-29T00:00:00.000Z>, :open 33.47, :high 33.56, :low 33.09, :close 33.5, :volume 857800} user=> (count sp) 1263
Everything appears to be working correctly. Let's turn our attention back to the news article dataset and begin analyzing it.
3.146.37.250