Lazily processing very large data sets

One of the good features of Clojure is that most of its sequence-processing functions are lazy. This allows us to handle very large datasets with very little effort. However, when combined with readings from files and other I/O, there are several things that you need to watch out for.

In this recipe, we'll take a look at several ways to safely and lazily read a CSV file. By default, the clojure.data.csv/read-csv is lazy, so how do you maintain this feature while closing the file at the right time?

Getting ready

We'll use a project.clj file that includes a dependency on the Clojure CSV library:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [org.clojure/data.csv "0.1.2"]])

We need to load the libraries that we're going to use into the REPL:

(require '[clojure.data.csv :as csv]
         '[clojure.java.io :as io])

How to do it…

We'll try several solutions and consider their strengths and weaknesses:

  1. Let's start with the most straightforward way:
    (defn lazy-read-bad-1 [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (csv/read-csv in-file)))
    user=> (lazy-read-bad-1 "data/small-sample.csv")
    IOException Stream closed  java.io.BufferedReader.ensureOpen (BufferedReader.java:97)

    Oops! At the point where the function returns the lazy sequence, it hasn't read any data yet. However, when exiting the with-open form, the file is automatically closed. What happened?

    First, the file is opened and passed to read-csv, which returns a lazy sequence. The lazy sequence is returned from with-open, which closes the file. Finally, the REPL tries to print out this lazy sequence. Now, read-csv tries to pull data from the file. However, at this point the file is closed, so the IOException is raised.

    This is a pretty common problem for the first draft of a function. It especially seems to bite me whenever I'm doing database reads, for some reason.

  2. So, in order to fix this, we'll just force all of the lines to be read:
    (defn lazy-read-bad-2 [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (doall
          (csv/read-csv in-file))))

    This will return data, but everything gets loaded into the memory. Now, we have safety but no laziness.

  3. Here's how we can get both:
    (defn lazy-read-ok [csv-file]
      (with-open [in-file (io/reader csv-file)]
        (frequencies
          (map #(nth % 2) (csv/read-csv in-file)))))

    This is one way to do it. Now, we've moved what we're going to do to the data into the function that reads it. This works, but it has a poor separation of concerns. It is both reading and processing the data, and we really should break these into two functions.

  4. Let's try it one more time:
    (defn lazy-read-csv [csv-file]
      (let [in-file (io/reader csv-file)
            csv-seq (csv/read-csv in-file)
            lazy (fn lazy [wrapped]
                   (lazy-seq
                     (if-let [s (seq wrapped)]
                       (cons (first s) (lazy (rest s)))
                       (.close in-file))))]
        (lazy csv-seq)))

This works! Let's talk about why.

How it works…

The last version of the function, lazy-read-csv, works because it takes the lazy sequence that csv/read-csv produces and wraps it in another sequence that closes the input file when there is no more data coming out of the CSV file. This is complicated because we're working with two levels of input: reading from the file and reading CSV. When the higher-level task (reading CSV) is completed, it triggers an operation on the lower level (reading the file). This allows you to read files that don't fit into the memory and process their data on the fly.

However, with this function, we again have a nice, simple interface that we can present to callers while keeping the complexity hidden.

Unfortunately, this still has one glaring problem: if we're not going to read the entire file (say we're only interested in the first 100 lines or something) the file handle won't get closed. For the use cases in which only a part of the file will be read, lazy-read-ok is probably the best option.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.88.54