Cross-validating the results

As I've already mentioned, the dataset for this chapter is a manually coded group of 500 hotel reviews taken from the OpinRank dataset. For this experiment, we'll break these into 10 chunks of 50 reviews each.

These chunks will allow us to use K-fold cross validation to test how our system is doing. Cross validation is a way of checking your algorithm and procedures by splitting your data up into equally sized chunks. You then train your data on all of the chunks but one; that is the training set. You calculate the error after running the trained system on the validation set. Then, you use the next chunk as a validation set and start over again. Finally, we can average the error for all of the trials.

For example, the validation procedure uses four folds, A, B, C, and D. For the first run, A, B, and C would be the training set, and D would be the test set. Next, A, B, and D would be the training set, and C would be the test set. This would continue until every fold is used as the test set once.

This may seem like a lot of work, but it helps us makes sure that we didn't just get lucky with our choice of training or validation data. It provides a much more robust way of estimating the error rates and accuracy of our classifier.

The main trick in implementing cross validation is that Clojure's native partitioning functions (partition and partition-all) don't handle extra items exactly the way we'd like. The partition function just throws the extras away, and partition-all sticks all of the extras to the end in a smaller group. What we'd like is to include the extras in the previous chunks. Each chunk should have one extra until all of the remainders are exhausted. To handle this, we'll define a function named partition-spread. It will partition the first part of the collection into larger chunks and the second part into smaller chunks.

Unfortunately, we'll need to know the size of the input collection. To do this, we must hold the entire collection in the memory at once, so this algorithm isn't good for very large sequences:

(defn partition-spread [k coll]
  (let [get-mod (fn [i x]
                  [(mod i k) x])
map-second #(map second (second %))]
    (->>coll
      (map-indexed get-mod)
      (group-by first)
      (map map-second))))

We can now see how these partitioning functions differ:

user=> (partition 4 (range 10))
((0 1 2 3) (4 5 6 7))
user=> (partition-all 4 (range 10))
((0 1 2 3) (4 5 6 7) (8 9))
user=> (xv/partition-spread 4 (range 10))
((0 4 8) (1 5 9) (2 6) (3 7))
user=> (xv/partition-spread 3 (range 10))
((0 3 6 9) (1 4 7) (2 5 8))

We can also see that the semantics of the first parameter have changed. Instead of indicating the size of the partitions, it specifies the number of partitions. Now the partitions are all of a roughly equal size.

Next, we'll create a couple of functions that pull out each chunk to use as the validation set and concatenates all the other chunks.

(defn step-folds-seq [folds steps]
  (lazy-seq
    (when-let [[s &ss] (seq steps)]
      (let [[prefix [validation & suffix]] (split-at s folds)
training (flatten (concat prefix suffix))
current [validation training]]
        (cons current (step-folds-seq folds ss))))))
(defn step-folds [folds]
  (step-folds-seq folds (range (count folds))))

Now, by partitioning into chunks with one element each, we can clearly see just how the K-fold partitioning works. Each time, a new chunk is selected as the validation set (the first item), and the rest of the chunks are concatenated into the training set (the second item):

user=> (xv/step-folds (xv/partition-spread 10 (range 10)))
([(0) (1 2 3 4 5 6 7 8 9)] [(1) (0 2 3 4 5 6 7 8 9)]
 [(2) (0 1 3 4 5 6 7 8 9)] [(3) (0 1 2 4 5 6 7 8 9)]
 [(4) (0 1 2 3 5 6 7 8 9)] [(5) (0 1 2 3 4 6 7 8 9)]
 [(6) (0 1 2 3 4 5 7 8 9)] [(7) (0 1 2 3 4 5 6 8 9)]
 [(8) (0 1 2 3 4 5 6 7 9)] [(9) (0 1 2 3 4 5 6 7 8)])

Now we can define a function that controls the K-fold validation process. It takes the training and error steps as function parameters, and it just handles partitioning the data into groups, calling the training and error functions, and combining their output into one result:

(defn k-fold
  ([train error combine data]
   (k-fold train error combine 10 data))
  ([train error combine k input-data]
   (->> input-data
shuffle
     (partition-spread k)
step-folds
     (map (fn [[v t]] [v (train t)]))
     (map (fn [[v t]] [err (error t v)]
                        (println :error err)
err)))
     (reduce combine (combine)))))

Now we need to decide what constitutes an error and how we'll compute it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.200.3