Parallelizing with reducers

In the last recipe, Combining function calls with reducers, we looked at the ability of reducers to compose multiple sequence processing functions into one function. This saves the effort of creating intermediate data structures.

Another feature of reducers is that they can automatically partition and parallelize the processing of tree-based data structures. This includes Clojure's native vectors and hash maps.

For this recipe, we'll continue the Monte Carlo simulation example that we started in the Partitioning Monte Carlo simulations for better pmap performance recipe. In this case, we'll write a version that uses reducers and see how it performs.

Getting ready

From the Partitioning Monte Carlo simulations for better pmap performance recipe, we'll use the same imports, as well as the rand-point, center-dist, and mc-pi functions. Along with these, we also need to require the reducers and Criterium libraries:

(require '[clojure.core.reducers :as r])
(use 'criterium.core)

Also, if you're using Java 1.6, you'll need the ForkJoin library, which you can get by adding this to your project.clj dependencies:

[org.codehaus.jsr166-mirror/jsr166y ""1.7.0""]

How to do it…

  1. This version of the Monte Carlo pi approximation algorithm will be structured in a similar manner to how mc-pi was structured in the Partitioning Monte Carlo simulations for better pmap performance recipe. First, we'll define a count-in-circle-r function that uses the reducers library to compose the processing and spread it over the available cores:
    (defn count-items [c _] (inc c))
    (defn count-in-circle-r [n]
      (->> (repeatedly n rand-point)
           vec
           (r/map center-dist)
           (r/filter #(<= % 1.0))
           (r/fold + count-items)))
    
    (defn mc-pi-r [n]
       (* 4.0 (/ (count-in-circle-r n) n)))
  2. Now, we can use Criterium to compare the two functions:
    user=> (quick-bench (mc-pi 1000000))
    WARNING: Final GC required 1.5511109061811519 % of runtime
    WARNING: Final GC required 1.3658615618441179 % of runtime
    Evaluation count : 6 in 6 samples of 1 calls.
                 Execution time mean : 983.603832 ms
        Execution time std-deviation : 16.553276 ms
       Execution time lower quantile : 964.015999 ms ( 2.5%)
       Execution time upper quantile : 1.007418 sec (97.5%)
                       Overhead used : 1.875845 ns
    
    Found 1 outliers in 6 samples (16.6667 %)
            low-severe       1 (16.6667 %)
     Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
    nil
    user=> (quick-bench (mc-pi-r 1000000))
    WARNING: Final GC required 8.023979507099268 % of runtime
    Evaluation count : 6 in 6 samples of 1 calls.
                 Execution time mean : 168.998166 ms
        Execution time std-deviation : 3.615209 ms
       Execution time lower quantile : 164.074999 ms ( 2.5%)
       Execution time upper quantile : 173.148749 ms (97.5%)
                       Overhead used : 1.875845 ns
    nil

Not bad. On eight cores, the version without reducers is almost six times slower. This is more impressive because we made relatively minor changes to the original code, especially when compared to the version of this algorithm that partitioned the input before passing it to pmap, which we also saw in the Partitioning Monte Carlo simulations for better pmap performance recipe.

How it works…

The reducers library does a couple of things in this recipe. Let's take a look at some lines from count-in-circle-r. Converting the input to a vector was important, because vectors can be parallelized, but generic sequences cannot.

Next, these two lines are combined into one reducer function that doesn't create an extra sequence between the call to r/map and r/filter. This is a small, but important, optimization, especially if we stacked more functions into this stage of the process:

     (r/map center-dist)
     (r/filter #(<= % 1.0))

The bigger optimization is in the line for r/fold. r/reduce always processes serially, but if the input is a tree-based data structure, r/fold will employ a fork-join pattern to parallelize it. This line takes the place of a call to count by incrementing a counter function for every item in the sequence so far:

     (r/fold + count-items))))

Graphically, this process looks something similar to this chart:

How it works…

The reducers library has a lot of promise to automatically parallelize structured operations with a level of control and simplicity that we haven't seen elsewhere.

See also

  • We'll see another example with reducers in the next recipe, Generating online summary statistics for data streams with reducers
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.45.5