Eager Transformations

While lazy sequences are the right solution in many cases, they’re not the best answer in every case. An eager approach can be better when we wish to construct an output collection rather than a sequence, optimize memory and performance for transformations of large collections, or control external resources.

Producing Output Collections

Sequence transformations produce output only as needed, which allows us to avoid unnecessary work and even to work with infinite sequences. However, it’s also common to encounter situations where we want our output to be a persistent collection, not a sequence. In this case, we want all of the transformation to be completed, so we won’t get any advantage from laziness.

For example, consider applying a function to a sequence to produce an output vector. In this case we’d need to apply a final vec function to transfer the elements of the sequence back into a collection:

 (​defn​ square [x] (* x x))
 
 (​defn​ sum-squares-seq [n]
  (vec (map square (range n))))

Recall that sequences cache the result of their evaluation so that they’re safe, immutable values, just like collections. This example will produce an input sequence (from the range), which stores the values in memory, then the output of the map, which stores another sequence of values, and then the final output vector. Sequences will share values if they are the same across sequences (as they are immutable), but you’ll have three times as much sequence overhead for the objects holding the values.

Instead, we can perform the intermediate transformation on the input collection values and place the result directly into the output vector using into with a map transducer. Let’s first see what that looks like before we discuss what it means:

 (​defn​ sum-squares
  [n]
  (into [] (map square) (range n)))

Here the call to into takes three arguments: the output collection, the transformation, and the input collection. The transformation here is a transducer. Many of the sequence functions (like map) can be called without their input collection argument to return a transducer. A transducer is a function that captures the essence of a collection transformation without tying it to the form of either the input collection or the output collection.

The into provides the elements from the input range to the map transformation and then adds the results directly into the output collection. This happens eagerly and returns the final vector, avoiding the creation of the intermediate lazy sequences, which are just overhead if we know the output target is a collection.

In the next section, we’ll see a more complicated example and examine the performance implications more closely.

Optimizing Performance

In the previous example we saw the difference between a single sequence transformation and a single transducer. We can also compose transducers and apply multiple transformations within into in a single pass over the input.

Imagine you want to find all of the predicate functions in the namespaces we’ve loaded so far. Predicate functions typically end in ?, so we can find all the loaded namespaces, then find their public vars, filter down to just those ending in ?, and finally convert them to friendly names. Using sequences, we often chain transformations together with ->>:

 (​defn​ preds-seq []
  (->> (all-ns)
  (map ns-publics)
  (mapcat vals)
  (filter #(clojure.string/ends-with? % ​"?"​))
  (map #(str (.-sym %)))
  vec))

Each step of this transformation creates a sequence, which caches the intermediate values of the sequence. Finally, at the end we collect the result into a vector.

Transducers can be composed in a similarly pipelined fashion using comp. Transducers are composed in a stack-like fashion, which means that comp combines them left-to-right, just like ->>. Using chained transducers with into looks like this equivalent function:

 (​defn​ preds []
  (into []
  (comp (map ns-publics)
  (mapcat vals)
  (filter #(clojure.string/ends-with? % ​"?"​))
  (map #(str (.-sym %))))
  (all-ns)))

The sequence implementation creates four intermediate sequences. The transducer implementation composes the four transformations into a single combined transformation and applies it during a single traversal of the (all-ns) input sequence. It then places the results directly into the output vector. Removing the intermediate sequences can result in a significant reduction in memory usage, particularly if the size of the input collection is large or the number of transformations is large.

The transducer version can also take advantage of some extra optimizations. First, if the input collection is reducible (that is, it knows how to perform a reduce on itself), then the resulting compiled code is often far more efficient. Here we can’t take advantage of this, but in the sum of squares example, range was a reducible collection.

Secondly, some Clojure collections can be more efficiently bulk loaded using transients. Transients temporarily make an immutable collection mutable while adding values, then switch the collection back to a normal immutable collection when done. This happens in a constrained scope such that users are never exposed to the mutable version. into (and some other transducer aware functions) automatically takes advantage of this optimization.

Now that you’ve seen some of the performance advantages of doing eager transformations with transducers, let’s consider how laziness complicates managment of external resources and how eager transformation can help.

Managing External Resources

When data is read from external resources like files, databases, or web services, it’s tempting to return a lazy sequence so we can begin processing the data as it’s read:

 (​defn​ non-blank? [s]
  (not (clojure.string/blank? s)))
 
 (​defn​ non-blank-lines-seq [file-name]
  (​let​ [reader (clojure.java.io/reader file-name)]
  (filter non-blank? (line-seq reader))))

The problem here is that the reader is not being closed. This resource is being left open and stranded in this code. Beyond that, it’s not clear when it could be closed, because we need to wait for the processing of the last line in the lazy sequence of lines. Another approach is to eagerly process all of the lines, then close the resource before returning:

 (​defn​ non-blank-lines [file-name]
  (with-open [reader (clojure.java.io/reader file-name)]
  (into [] (filter non-blank?) (line-seq reader))))

This solves the dangling resource problem but only by eagerly creating the entire vector of non-blank lines and returning it. That’s fine for small files but is a recipe for an eventual OutOfMemoryError with a sufficiently large file. What we really want is to process the file, without the caching aspects of lazy sequences, and to know when the input has been exhausted so that the reader can be closed.

First let’s refactor the non-blank-lines function to take a reader and return an eduction:

 (​defn​ non-blank-lines-eduction [reader]
  (eduction (filter non-blank?) (line-seq reader)))

An eduction is a suspended transformation (like a sequence), but it processes the entire input each time its asked (usually just once). Because it’s processed anew each time, there’s no caching as with a sequence. Instead, the transformation is just applied to the input to produce an output in a full pass through the data, usually performed with a reduce (when the output is a collection) or an into (when the output is a single computed value). For example, we can use non-blank-lines-eduction to count lines:

 (​defn​ line-count [file-name]
  (with-open [reader (clojure.java.io/reader file-name)]
  (reduce (​fn​ [cnt el] (inc cnt)) 0 (non-blank-lines-eduction reader))))

In line-count, we first create the reader in a with-open block, which will automatically close the reader before returning the result of the body. Next we process the eduction in the context of a reduce. The eduction processes each line and then releases the associated memory, so the eduction will hold only one line in memory at a time.

Transducers and eductions allow us to choose exactly when an input source is processed and thus know when the external resource is done and ready to release. With lazy sequences, it’s much harder to know when processing is “done” and it’s time to release.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.7.102