Parsing CSV files with Cascalog

In the previous recipe, the file we read was a CSV file, but we read it line by line. That's not optimal. Cascading provides a number of taps—sources of data or sinks to send data to—including one for CSV and other delimited data formats. Also, Cascalog has some good wrappers for several of these taps, but not for the CSV one.

In truth, creating a wrapper that exposes all the functionality of the delimited text format tap will be complex. There are options for delimiter characters, quote characters, including a header row, the types of columns, and other things. That's a lot of options, and dispatching to the right method can be tricky.

We won't worry about how to handle all the options right here. For this recipe, we will create a simple wrapper around the delimited text file tap that includes some of the more common options to read CSV files.

Getting ready

First, we'll need to use some of the same dependencies as the ones we've been using as well as some new ones. Here are the full dependencies that we'll need in our project.clj file:

(defproject distrib-data "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [cascalog "2.1.1"]
                 [org.slf4j/slf4j-api "1.7.7"]]
  :profiles {:dev
             {:dependencies
              [[org.apache.hadoop/hadoop-core "1.2.1"]]}})

Also, we'll need to import a number of namespaces from these libraries into our script or REPL:

(require '[cascalog.logic.ops :as c]
         '[cascalog.cascading.tap :as tap]
         '[cascalog.cascading.util :as u])
(use 'cascalog.api)
(import [cascading.tuple Fields]
        [cascading.scheme.hadoop.TextDelimited])

We'll also use the data file that we did in the Distributing data with Apache HDFS recipe. You can access it either locally or through HDFS, as we did earlier. I'll access it locally for this recipe.

How to do it…

  1. We just need to write a function that creates a cascading.scheme.hadoop.TextDelimited tap scheme with the correct options and then calls the cascalog.tap/hfs-tap Cascalog function with it. That will handle the rest, as shown here:
    (defn hfs-text-delim
      [path & {:keys [fields has-header delim quote-str]
               :as opts
               :or {fields Fields/ALL, has-header false, delim ",",
                    quote-str """}}]
      (let [scheme (TextDelimited. (w/fields fields) has-header delim
                                   quote-str)
            tap-opts (select-keys opts [:scascalog :sinkmode
                                        :sinkparts
                                        :source-pattern
                                        :sink-template
                                        :templatefields])]
        (apply tap/hfs-tap scheme path tap-opts)))
  2. Now, let's try this out:
    user=> (?<- (stdout)
         [?origin_airport ?destin_airport]
         ((hfs-text-delim "data/16285/flights_with_colnames.csv"
                          :has-header true)
          ?origin_airport ?destin_airport ?passengers ?flights ?month))
    …
    RESULTS
    -----------------------
    MHK     AMW
    EUG     RDM
    EUG     RDM
    EUG     RDM
    …

How it works…

This function takes a number of options, such as fields, has-header, delim, and quote-str. The defaults are for CSV files, but they can be easily overridden for a variety of other formats. We saw the use of the :has-header option in the previous example.

With the options in hand, it creates a TextDelimited scheme object. And finally passes it to the hfs-tap function, which wraps the scheme object in a tap. The tap serves as a data generator, and we bind the values from it to the names in our query.

There's more

Hadoop can consume a number of different file formats. Avro (http://avro.apache.org/) uses JSON schemas to store data in a fast, compact, and binary data format. Sequence files (http://wiki.apache.org/hadoop/SequenceFile) contain a binary key-value store. XML and JSON are also common data formats.

If we want to parse our own data formats in Cascading or Cascalog, we'll need to write our own source tap (http://docs.cascading.org/cascading/2.5/userguide/html/ch03s05.html). If it's a delimited text format, such as CSV or TSV, we can base the new tap on cascading.scheme.hadoop.TextDelimited, just as we did in this recipe. See the JavaDocs for this class at http://docs.cascading.org/cascading/2.5/cascading-hadoop/cascading/scheme/hadoop/TextDelimited.html for more information on this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.202.27