23. A CSV Reader

Sooner or later, in your day-to-day work you’ll need to read a comma-separated value (CSV) file and perform an operation on the result. To make it even more interesting, we’re going to do this exercise with four million rows.

In this recipe we’ll write a large CSV to a file, and then read it in Clojure in a way that bypasses the performance overhead of the Clojure reader.

Assumptions

In this chapter we assume the following:

Image You have 300 MB of disk space spare.

Image You’re OK with waiting 4 minutes for a process to finish.

Benefits

The benefits of this chapter are that you’ll learn how to process a large CSV file (one that is larger than Excel can handle), avoiding the traps in the Clojure reader and realizing performance comparable to Java.

The Recipe—Code

We’ll start by creating a directory for our demonstration.

1. Create a new Leiningen project csv-example in your projects directory, and change to that directory:

lein new app csv-example
cd csv-example

2. Modify the project.clj to look like the following:

(defproject csv-example "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.7.0-beta2"]]
  :main csv-example.core)

3. Now create the file src/csv-example/generate_csv.clj with the following contents:

(ns csv-example.generate-csv
  (:gen-class)
  (:require [clojure.java.io :as io]))

(defn rand-line
  "Return one line for a CSV with four columns of random integers."
  []
  (str (apply str (repeatedly 4 #(str(rand 100) ","))) " "))

(defn -main
  "Generate a CSV."
  [& args]
  (with-open [writer (io/writer "sample-data.csv")]
    (dotimes [n 4000001]
      (.write writer (rand-line)))))

4. Now modify the file src/csv_example.core.clj to have the following contents:

(ns csv-example.core
  (:gen-class)
   (:require [clojure.java.io :as io]))

(defn -main
  "Import a CSV."
  [& args]
  (println "starting")
  (let [col-sum (atom 0)
        filename-from-args (get args 0)
        fileName (if (empty? filename-from-args) "sample-data.csv"
filename-from-args)
        stream (io/reader fileName)]
    (loop []
      (let [line ^String (.readLine stream)]
        (if (and (not (nil? line)) (> (count line) 0))
          (do
            (let [col1 (read-string (aget (.split line ",") 0))]
              (reset! col-sum (+ col1 @col-sum))
              (recur ))))))
    (println "done - sum is " (str @col-sum))))

Testing the Solution

Follow these steps to test the solution:

1. First, we’ll generate the CSV file. Run the following command to execute the main function in a namespace that is not defined in the project.clj as main:

lein run -m csv-example.generate-csv

Now wait up to 4 minutes for it to complete. When it is done, you should have a 294 MB CSV file in your directory.

2. Now kick it off by running the following on the command prompt:

lein run sample-data.csv

You should see the following result (or a number with some similarity):

starting

... (wait for up to 4 minutes)

done - sum is  1.9998616218292153E8

Notes on the Recipe

For basic scenarios Clojure’s data.csv library is excellent. You can read the home page to walk through a simple tutorial: https://github.com/clojure/data.csv/.

Now on a pragmatic level, a lot of use cases having to do with data transformation on a CSV can easily be done in Excel. However, Excel has its limits. The maximum number of rows in a 32-bit Excel worksheet is 1,048,576. Once you go over this number, you need to look at another tool (or split the file, which could be messy later).

This gives us an opportunity to look at data.csv and clojure.csv. The challenge with both of these libraries is that the data they load goes through the Clojure reader. At scale, the Clojure reader slows you down. You can use these libraries if performance is not a concern.

In your work scenario, however, it is likely that Clojure is a new tool, where you need to defend the choice of its use. In addition, processing one million lines of a CSV is going to take around 60 seconds in the best implementation, and using the two libraries above in their default settings will take even longer. In this scenario, defending the Clojure choice will be even harder.

This chapter is about taking up that challenge. How can we make Clojure do something that Excel can’t do, in a timeframe that is somewhat comparable to that required to solve the problem in Java? The benefit will be that we’ll cover file IO in Clojure.

Taking a look at the code in csv-example.generate-csv, we observe that we brought in the Clojure IO libraries. Then we defined a function rand-line that will return a string containing four random numbers, concatenated as a string, separated by commas. Then we repeatedly call this function 4 million times, and each time the function is called, we append the line to a file.

Now looking at the code in csv-example.core, we set up our namespace to generate a Java class and import the Java library classes for reading streams from a file. Then we defined col-sum. The value first-cols will hold a lazy sequence to split a line. In our -main function we load up the file as a stream. Then line by line, we split it by comma, read the first column as a number, and then add it to our col-sum value. When we’re done we print the value of col-sum.

You’ll notice we chose to use (aget (.split rather than (first. This is because we didn’t want to invoke the Clojure reader on the value being read, which is what causes the performance issues in other implementations. Using this approach, we get a performance result comparable to raw Java.

Conclusion

We have gotten started with reading CSV files in Clojure at a level of performance comparable to raw Java. From here we can extend this to apply business logic and handle scenarios where Excel just can’t cut it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.226.255