Generating online summary statistics for data streams with reducers

We can use reducers in a lot of different situations, but sometimes we need to change how we process data to do so.

For this example, we'll show you how to compute summary statistics with reducers. We'll use some algorithms and formulas, first proposed by Tony F. Chan, Gene H. Golub, and Randall J. LeVeque in 1979, and later extended by Timothy B. Terriberry in 2007. These allow you to approximate the mean, standard deviation, and skew for online data (that is, to stream data that we might only see once). So, we will need to compute all of the statistics on one pass without holding the full collection in memory.

The following formulas are a little complicated and difficult to read in lisp notation. However, there's a good overview of this process, with formulas, on the Wikipedia page for Algorithms to calculate variance (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance). In order to somewhat simplify this example, we'll only calculate the mean and variance.

Getting ready

For this, we'll need to have easy access to the reducers library and the Java Math class:

(require '[clojure.core.reducers :as r])
(import '[java.lang Math])

How to do it…

For this recipe, we'll first define the accumulator data structures and then the accumulator functions. Finally, we'll put it all together.

  1. We need to define a data structure to store all of the data that we want to accumulate and keep track of:
    (def zero-counts {:n (long 0), :s 0.0, 
                      :mean 0.0, :m2 0.0})
  2. Now, we'll need some way to add data to the counts and accumulation. The accum-counts function will take care of this:
    (defn accum-counts
      ([] zero-counts)
      ([{:keys [n mean m2 s] :as accum} x]
       (let [new-n (long (inc n))
             delta (- x mean)
             delta-n (/ delta new-n)
             term-1 (* delta delta-n n)
             new-mean (+ mean delta-n)]
         {:n new-n
          :mean new-mean
          :s (+ s x)
          :m2 (+ m2 term-1)})))
  3. Next, we'll need a way to combine two accumulators. This has the complete, unsimplified versions of the formulae from accum-counts. Because some of the numbers can get very large and can overflow the range of the primitive Java types, we'll use *'. This is a variant of the multiplication operator that automatically promotes values into Java's BigInteger types instead of overflowing:
    (defn op-fields
      "A utility function that calls a function on
      the values of a field from two maps."
      [op field item1 item2]
      (op (field item1) (field item2)))
    
    (defn combine-counts
      ([] zero-counts)
      ([xa xb]
       (let [n (long (op-fields + :n xa xb))
             delta (op-fields - :mean xb xa)
             nxa*xb (*' (:n xa) (:n xb))]
         {:n n
          :mean (+ (:mean xa) (* delta (/ (:n xb) n)))
          :s (op-fields + :s xa xb)
          :m2 (+ (:m2 xa) (:m2 xb)
                 (* delta delta (/ nxa*xb n)))})))
  4. Now, we need a way to take the accumulated counts and values and turn them into the final statistics:
    (defn stats-from-sums [{:keys [n mean m2 s] :as sums}]
       {:mean (double (/ s n))
        :variance (/ m2 (dec n))})
  5. Finally, we can combine all of these functions to produce results:
    (defn summary-statistics [coll]
      (stats-from-sums
        (r/fold combine-counts accum-counts coll)))

For a pointless example, we can use this to find summary statistics on 1,000,000 random numbers:

user=> (summary-statistics (repeatedly 1000000 rand))
{:mean 0.5004908831693459, :variance 0.08346136740444697}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.50.156