Setting up the basics

Before we really dig into the project and the data, we need to prepare. We'll set up the code and the library, and then we'll download the data.

Setting up the library

First, we'll need to initialize the library. We can do this using Leiningen 2 (http://leiningen.org/) and Stuart Sierra's reloaded plugin for it (https://github.com/stuartsierra/reloaded). This will initialize the development environment and project.

To do this, just execute the following command at the prompt (I've named the project financial in this case):

lein new reloaded financial

Now, we can specify the libraries that we'll need to use. We can do this in the project.clj file. Open it and replace its current contents with the following lines:

(defproject financial "0.1.0-SNAPSHOT":dependencies [[org.clojure/clojure "1.5.1"][org.clojure/data.xml "0.0.7"][org.clojure/data.csv "0.1.2"][clj-time "0.6.0"][me.raynes/fs "1.4.4"][org.encog/encog-core "3.1.0"][enclog "0.6.3"]]:profiles
  {:dev {:dependencies [[org.clojure/tools.namespace "0.2.4"]]
            :source-paths ["dev"]}})

The primary library that we'll use is Enclog (https://github.com/jimpil/enclog). This is a Clojure wrapper around the Java library Encog (http://www.heatonresearch.com/encog), which is a machine learning library, including classes for artificial neural networks.

We now have the basics in place. We can get the data at this point.

Getting the data

We'll need data from two different sources. To begin with, we'll focus on getting the stock data.

In this case, we're going to use the historical stock data for Dominion Resources, Inc. They're a power company that operates in the eastern United States. Their New York Stock Exchange symbol is D. Focusing on one stock like this will reduce possible noise and allow us to focus on the simple system that we'll be working on in this chapter.

To download the stock data, I went to Google Finance (https://finance.google.com/). In the search box, I entered NYSE:D. On the left-hand side menu bar, there is an option to download Historical prices. Click on it.

In the table header, set the date range to be from Sept 1, 1995 to Jan 1, 2001. Refer to the following screenshot as an example:

Getting the data

If you look at the lower-right corner of the screenshot, there's a link that reads Download to spreadsheet. Click on this link to download the data. By default, the filename is d.csv. I moved it into a directory named d inside my project folder and renamed it to d-1995-2001.csv.

We'll also need some news article data to correlate with the stock data. Freely available news articles are difficult to come by. There are good corpora available for modest fees (several hundred dollars). However, in order to make this exercise as accessible as possible, I've limited the data to what's freely available.

At the moment, the best collection appears to be the journalism segment of the Open American National Corpus (http://www.anc.org/data/oanc/). The American National Corpus (ANC) is a collection of texts from a variety of registers and genres that are assembled for linguistic research. The Open ANC (OANC) is the subset of the ANC that is available for open access downloading. The journalism genre is represented by articles from Slate (http://www.slate.com/). This has some benefits and introduces some problems. The primary benefit is that the data will be quite manageable. It means that we won't have a lot of documents to use for training and testing, and we'll need to be pickier about what features we pull from the documents. We'll see how we need to handle this later.

To download the dataset, visit the download page at http://www.anc.org/data/oanc/download/ and get the data in your preferred format, either a TAR ball or a ZIP file. I decompressed that data into the d directory. It created a directory named OANC-GrAF that contained the data.

Your d directory should now look something as follows:

Getting the data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.154