17. Cascalog and Structured Data

The purpose of this chapter is to provide a project to load structured data into Cascalog so that you can become familiar with the APIs and steps involved.

Assumptions

In this chapter we assume the following:

Image You have Leiningen set up.

Image You have worked through the previous Cascalog chapters.

Benefits

The benefit of this chapter is that you’ll see a pattern to take free-form data formats (such as sentences) and transform them in advance into the format that Cascalog requires.

The Recipe—Code

To transform free-format data into Cascalog format, follow these steps:

1. Create a new Leiningen project cascalog-pre-format in your projects directory, and change to that directory:

lein new app cascalog-pre-format
cd cascalog-pre-format

2. Then ensure that the projects.clj file looks like this:

(defproject cascalog-pre-format "0.1.0-SNAPSHOT"
  :uberjar-name "query-novel.jar"
  :main cascalog-pre-format.query-novel
  :repositories  {"conjars" "http://conjars.org/repo/"}
  :dependencies [[org.clojure/clojure "1.7.0-RC1"]
                 [cascading/cascading-hadoop2-mr1 "2.7.0" ]
                 [cascalog/cascalog-core "2.1.1"]
                 [cascalog/cascalog-more-taps "2.1.1"]]
  :profiles {:provided
             {:dependencies
              [[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
               [org.apache.hadoop/hadoop-common "2.7.0"]]}})

3. Next we’ll get our source data. We’ll start with a classic book, The Adventures of Huckleberry Finn. We’ll download it from the Gutenberg website to a new data directory you will create:

mkdir data

4. Navigate to the following location in your web browser:

http://www.gutenberg.org/epub/76.txt.utf-8

Then save the resulting file as AdventuresOfHuckleBerryFinn.txt in your data directory. (Don’t copy and paste—use the save function in your browser.)

5. Create a directory src/cascalog_pre_format/query_novel.clj with the following contents:

 (ns cascalog-pre-format.query-novel
  (:require [clojure.string :as s]
            [cascalog.api :refer :all]
            [cascalog.more-taps :refer [hfs-delimited]])
  (:gen-class))

(defmapcatfn split [line]
  "Reads in a line of string and splits it by regex."
  (filter #(not (empty? %)) (map clojure.string/trim (s/split line #"."))))

(defmapcatfn filter-condition [sentence]
  "Find matching examples."
  (if
    (and
      (> (count (re-seq #"Jim" sentence)) 0)
      (> (count (re-seq #"free man" sentence)) 0))
    (list sentence)))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
         [?filtered]
         ((hfs-delimited in :skip-header? false) ?line)
         (split ?line :> ?sentence)
         (filter-condition ?sentence :> ?filtered)))

6. Create another file src/cascalog_pre_format/novel.clj with the following contents:

(ns cascalog-pre-format.format-novel
  (:require [clojure.string :as s])
  (:gen-class))

(defn -main [file & args]
  ; read the whole file
  (let [book (slurp file)
    ; Replace all the newlines trailed by lowercase letters by spaces
    ; note the challenge is "I" in a sentence :) (or any other proper noun)
        book-stripped-newlines (s/join " " (clojure.string/split book
#" (?=[a-z])"))
        ; Split the string into a vector based on periods followed by space
and non-breaking space
        book-sentences (s/split book-stripped-newlines #"(?<=.)x20")
        ; join the vector into a single string delimited by newlines
        book-sentences-flat (s/join " " book-sentences)]
    ; write the whole file
    (spit "./data/tuples.txt" book-sentences-flat)))

Testing the Recipe

Let’s transform the data file we downloaded so that the sentences are all on one line. We’ll do this in Clojure.

1. Run the following command:

lein run -m cascalog-pre-format.format-novel "data/
AdventuresOfHuckleBerryFinn.txt"

2. Now we have a file of tuples with a sentence on each line. Now let’s run a Hadoop query on this. From a command prompt in the cascalog-pre-format directory, enter the following:

rm –rf output
lein uberjar
hadoop jar target/query-novel.jar data/tuples.txt output/match

3. Check you’ve got results with

ls output

and ensure you see a match directory.

4. Now run this command to open up the part-00000:

cat output/match/part-00000

You should see the following:

"_Now_, old Jim, you're a free man again, and I bet you won't ever be a slave
no more Jim; and then you kept Tom here so long with the butter in his hat
that you come near spiling the whole business, because the men come before we
was out of the cabin, and we had to rush, and they heard us and let drive at
us, and I got my share, and we dodged out of the path and let them go by, and
when the dogs come they warn't interested in us, but went for the most noise,
and we got our canoe, and made for the raft, and was all safe, and Jim was a
free man, and we done it all by ourselves, and _wasn't_ it bully, Aunty!"

So we just implemented a query to find the big idea of Huckleberry Finn (freedom for Jim).

Notes on the Solution

Notice that we’ve been using a strict tab-delimited file as input to our Hadoop process. This is a Cascading convention, the assumption being that your file is line-oriented and consists of one or more columns.

This is a reasonable assumption for reading financial information or log files. But what if your data set wasn’t in this format? Suppose you had a binary input format, or a complex data structure representing a protein fold, or simply a novel with sentences spread across multiple lines.

To use the Cascading library, there needs to be an extra step prior to the map reduce concept. You’d need to preformat your data so our process at a high level would look like this:

Image

The particular thing to notice in this code is that we’ve used a regular Clojure function inside the Cascalog query at split to transform the data line by line. Notice the defmapcatfn for filter-condition. We’ve also used a Clojure function filter-condition to filter the Cascalog tuples so that only matching rows are returned.

If you open the file from the data directory and scroll halfway down, you’ll see text that looks something like this:

she begun to cry, though I couldn't hear her, and her back was to me.  I
slid out, and as I passed the dining-room I thought I'd make sure them
watchers hadn't seen me; so I looked through the crack, and everything
was all right. They hadn't stirred.

I slipped up to bed, feeling ruther blue, on accounts of the thing
playing out that way after I had took so much trouble and run so much
resk about it. Says I, if it could stay where it is, all right; because

This text formatting comes across as conventional to you and me but not to the computer, so it is worth spelling out that

Image the sentences all have their ordinary endings of a period (or question mark),

Image the sentences run across lines,

Image the paragraphs are broken up by two newline characters,

Image the sentences start with a capital letter, and

Image the lines that start with a capital letter after a full stop on the previous line are the start of a new sentence.

Conclusion

We’ve seen how to format data to make it accessible to Cascalog when it is not in a tab-delimited format. This is for a small file that could be read and held in memory. What if the file was bigger? Could we build a streaming reader? We’ll look at that in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.51.153