Merging text and stock features

Before we can start to train the neural network, however, we'll need to figure out how we need to represent the data and what information the neural network needs to have.

The code for this section will be present in the src/financial/nn.clj file. Open it up and add the following namespace header:

(ns financial.nn
  (:require [clj-time.core :as time]
            [clj-time.coerce :as time-coerce]
            [clojure.java.io :as io]
            [enclog.nnets :as nnets]
            [enclog.training :as training]
            [financial.utils :as u]
            [financial.validate :as v])
  (:import [org.encog.neural.networks PersistBasicNetwork]))

However, we first need to be clear about what we're trying to do. That will allow us to properly format and present the data.

Let's break it down like this: for each document, based on the previous stock prices and the tokens in a document, can we predict the direction of future stock prices.

So one set of features will be the tokens in the document. We already have those identified earlier. Other features can represent the stock prices. Since we're actually interested in the direction of the future prices, we can actually use the difference between the stock prices of a point in the past and of the day the article was published. Offhand, we're not sure what time frames will be helpful, so we can select several and include them all.

The output is another difference in stock prices. Again, we don't know at what difference in time we'll be able to get good results (if any!), so we'll try to look out into the future at various distances.

For the ranges of time, we'll use some standard time periods, gradually getting further and further out: a day, two days, three days, four days, five days, two weeks, three weeks, one month, two months, six months, and one year. Days that fall on a weekend have the value of the previous business day. Months will be 30 days, and a year is 365 days. This way, the time periods will be more or less regular.

We can represent those periods in Clojure using the clj-time library (https://github.com/clj-time/clj-time) as follows:

(def periods [(time/days 1)
              (time/days 2)
              (time/days 3)
              (time/days 4)
              (time/days 5)
              (time/days (* 7 2))
              (time/days (* 7 3))
              (time/days 30)
              (time/days (* 30 2))
              (time/days (* 30 6))
              (time/days 365)])

For the features, we'll use the difference in price over those periods. The easiest way to get at that information would be to index the stock prices by date and then access the prices from there using some utility functions. Let's see what that would look like:

(defn index-by [key-fn coll]
  (into {} (map #(vector (key-fn %) %) coll)))
(defn get-stock-date [stock-index date]
  (if-let [price (stock-index date)]
    price
    (if (<= (time/year date) 1990)
      nil
      (get-stock-date
        stock-index (time/minus date (time/days 1))))))

We can use index-by to index a collection of anything into a map. The other function, get-stock-date, then attempts to get the StockData instance from the index. If it doesn't find one, it tries the previous day. If it ever works its way before 1990, it just returns nil.

Now let's get the input feature vector from a NewsArticle instance and the stock index.

The easy part of this will be getting the token vector. Getting the price vector will be more complicated, and we'll be doing almost the same thing twice: once looking backward from the article for the input vector, and once looking forward from the article for the output vector. Since generating these two vectors will be mostly the same, we'll write a function that does it and accepts function parameters for the differences, as shown in the following code:

(defn make-price-vector [stock-index article date-op]
  (let [pub-date (:pub-date article)
        base-price (:close (get-stock-date stock-index pub-date))
        price-feature
        (fn [period]
          (let [date-key (date-op pub-date period)]
            (if-let [stock (get-stock-date stock-index date-key)]
              (/ (price-diff base-price (:close stock))
                 base-price)
              0.0)))]
    (vec (remove nil? (map price-feature periods)))))

The make-price-vector function gets the base price from the day the article was published. It then gets the day offsets that we outlined previously and finds the closing stock price for each of those days. It finds the difference between the two prices.

The parameter for this function is date-op, which gets the second day to find the stock price for. It will either add the period to the article's publish date or subtract it, depending on whether we're looking in the future or the past.

We can build on this to make the input vector, which will contain the token vector and the price vector, as shown in the following code:

(defn make-feature-vector [stock-index vocab article]
  (let [freqs (:text article)
        token-features (map #(freqs % 0.0) (sort vocab))
        price-features (make-price-vector
                         stock-index article time/minus)]
    (vec (concat token-features price-features))))

For the token vector, we get the frequencies from the NewsArticle instance in the order given by the vocab collection. This should be the same across all NewsArticle instances. We call make-price-vector to get the prices for the offset days. Then we concatenate all of them into one (Clojure) vector.

The following code gives us the input vector. However, we'll also want to have future prices as the output vector.

(defn make-training-vector [stock-index article]
  (vec (make-price-vector stock-index article time/plus)))

The preceding code is just a thin wrapper over make-price-vector. It calls this function with the appropriate arguments to get the future stock price.

Finally, we'll write a function that takes a stock index, a vocabulary, and a collection of articles. It will generate both the input vector and the expected output vector, and it will return both stored in a hash map. The code for this function is given as follows:

(defn make-training-set [stock-index vocab articles]
  (let [make-pair
        (fn [article]
          {:input (make-feature-vector stock-index vocab article)
           :outputs (zipmap periods
                            (make-training-vector
                              stock-index article))})]
    (map make-pair articles)))

This code will make it easy to generate a training set from the data that we've been working with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.111.179