Predicting the future

Now is the time to bring together everything that we've assembled over the course of this chapter, so it seems appropriate to start over from scratch, just using the Clojure source code that we've written over the course of the chapter.

We'll take this one block at a time, loading and processing the data, creating training and test sets, training and validating the neural network, and finally viewing and analyzing its results.

Before we do any of this, we'll need to load the proper namespaces into the REPL. We can do that with the following require statement:

user=> (require
         [me.raynes.fs :as fs]
         [financial]
         [financial.types :as t]
         [financial.nlp :as nlp]
         [financial.nn :as nn]
         [financial.oanc :as oanc]
         [financial.csv-data :as csvd]
         [financial.utils :as u])

This will give us access to everything that we've implemented so far.

Loading stock prices

First, we'll load the stock prices with the following commands:

user=> (def stocks (csvd/read-stock-prices "d/d-1996-2001.csv"))
user=> (def stock-index (nn/index-by :date stocks))

The preceding code loads the stock prices from the CSV file and indexes them by date. This will make it easy to integrate them with the new article data in a few steps.

Loading news articles

Now we can load the news articles. We'll need two pieces of data from them: the TF-IDF scaled frequencies and the vocabulary list. Look at the following commands:

user=> (def slate (doall
                    (map oanc/load-article
                         (oanc/find-slate-files
                           (io/file "d/OANC-GrAF")))))
user=> (def corpus (nlp/process-articles slate))
user=> (def freqs (nlp/tf-idf-all corpus))
user=> (def vocab (nlp/get-vocabulary corpus))

This code binds the frequencies as freqs and the vocabulary as vocab.

Creating training and test sets

Since we bundled the entire process into one function, merging our two data sources together into one training set is simple, as shown in the following command:

user=> (def training
         (nn/make-training-set stock-index vocab freqs))

Now, for each article, we have an input vector and a series of output for different stock prices related to the data.

Finding the best parameters for the neural network

The training data and the parameters' value ranges are the input for exploring the network parameter space. Look at the following commands:

user=> (def error-rates (ref {}))
user=> (nn/explore-params error-rates (count vocab) training)

This takes a very long time to run. Actually, I looked at the output it was producing and realized that it wouldn't be able to predict well beyond a day or two, so I stopped it after that. Thanks to my decision to pass in a reference, I was able to stop it and still have access to the results generated by that point.

The output is a mapping from the period and number of hidden nodes to a list of SSE values generated from each partition in the K-fold cross-validation. A more meaningful metric would be the average of the errors. We can generate that here and print out the results as follows:

user=> (def error-means
         (into {}
               (map #(vector (first %) (u/mean (second %)))
                    @error-rates)))
user=> (pprint (sort-by second error-means))
([[#<Days P1D> 10] 1.0435393]
 [[#<Days P1D> 5] 1.5253379]
 [[#<Days P1D> 25] 5.0099998]
 [[#<Days P1D> 50] 32.00977]
 [[#<Days P1D> 100] 34.264244]
 [[#<Days P1D> 200] 60.73007]
 [[#<Days P1D> 300] 100.29568])

So the squared sum of errors for predicting one day ahead go from about 1 for 10 hidden units to 100 for 300 hidden units. So, based on that, we'll only train a network to predict one day into the future and to use 10 hidden nodes.

Training and validating the neural network

Actually, training the neural network is pretty easy from our end, but it does take a while. The following commands should somewhat produce better results than we saw before, but at the cost of some time. Remember that the training process may not take this long, but we should probably be prepared.

user=> (def nn (nn/make-network (count vocab) 10))
user=> (def day1 (first nn/periods))
user=> (nn/train-for nn day1 training)
Iteration # 1 Error: 22.025400% Target-Error: 1.000000%
Iteration # 2 Error: 19.332094% Target-Error: 1.000000%
Iteration # 3 Error: 14.241920% Target-Error: 1.000000%
Iteration # 4 Error: 6.283643% Target-Error: 1.000000%
Iteration # 5 Error: 0.766110% Target-Error: 1.000000%

Well, that was quick.

This gives us a trained, ready-to-use neural network bound to the name nn.

Running the network on new data

We can now run our trained network on some new data. Just to have something to look at, I downloaded 10 articles off the Slate website and saved them to files in the directory d/slate/. I also downloaded the stock prices for Dominion, Inc.

Now, how would I analyze this data?

Before we really start, we'll need to pull some data from the processes we've been using, and we'll need to set up some reference values, such as the date of the documents. Look at the following code:

(def idf-cache (nlp/get-idf-cache corpus))
(def sample-day (time/date-time 2014 3 20 0 0 0))
(def used-vocab (set (map first idf-cache)))

So we get the IDF cache, the date the articles were downloaded on, and the vocabulary that we used in training. That vocabulary set will serve as the token whitelist for loading the news articles.

Let's see how to get the documents ready to analyze. Look at the following code:

(def articles (doall
                (->> "d/slate/"
                  fs/list-dir
                  (map #(str "d/slate/" %))
                  (map #(oanc/load-text-file sample-day %))
                  (nlp/load-text-files used-vocab idf-cache))))

This is a little more complicated than it was when we loaded them earlier. Basically, we just read the directory list and load the text from each one. Then we tokenize and filter it before determining the TF-IDF value for each token.

On the other hand, reading the stocks is very similar to what we just did. Look at the following code:

(def recent-stocks (csvd/read-stock-prices "d/d-2013-2014.csv"))
(def recent-index (nn/index-by :date recent-stocks))

With these in hand, we can put both together to make the input vectors as shown in the following code:

(def inputs
  (map #(nn/make-feature-vector recent-index used-vocab %)
       articles))

Now let's see how to run the network and see what happens. Look at the following:

user=> (pprint
         (flatten
           (map vec
                (map #(nn/run-network nn %) inputs))))
(0.5046613110846201
 0.5046613110846201
 0.5046613135395166
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613110846201
 0.5046613112651592
 0.5046613110846201)

These items are very consistent. To quite a few decimal places, they're all clustered right around 0.5. From the sigmoid function, this means that it doesn't really anticipate a stock change over the next day.

In fact, this tracks what actually happened fairly well. On March 20, the stock closed at $69.77, and on March 21, it closed at $70.06. This was a gain of $0.29.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.133