The purpose of this chapter is to provide a project to load structured data into Cascalog so that you can become familiar with the APIs and steps involved.
In this chapter we assume the following:
You have Leiningen set up.
You have worked through the previous Cascalog chapters.
The benefit of this chapter is that you’ll see a pattern to take free-form data formats (such as sentences) and transform them in advance into the format that Cascalog requires.
To transform free-format data into Cascalog format, follow these steps:
1. Create a new Leiningen project cascalog-pre-format
in your projects directory, and change to that directory:
lein new app cascalog-pre-format
cd cascalog-pre-format
2. Then ensure that the projects.clj
file looks like this:
(defproject cascalog-pre-format "0.1.0-SNAPSHOT"
:uberjar-name "query-novel.jar"
:main cascalog-pre-format.query-novel
:repositories {"conjars" "http://conjars.org/repo/"}
:dependencies [[org.clojure/clojure "1.7.0-RC1"]
[cascading/cascading-hadoop2-mr1 "2.7.0" ]
[cascalog/cascalog-core "2.1.1"]
[cascalog/cascalog-more-taps "2.1.1"]]
:profiles {:provided
{:dependencies
[[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
[org.apache.hadoop/hadoop-common "2.7.0"]]}})
3. Next we’ll get our source data. We’ll start with a classic book, The Adventures of Huckleberry Finn. We’ll download it from the Gutenberg website to a new data
directory you will create:
mkdir data
4. Navigate to the following location in your web browser:
http://www.gutenberg.org/epub/76.txt.utf-8
Then save the resulting file as AdventuresOfHuckleBerryFinn.txt
in your data
directory. (Don’t copy and paste—use the save function in your browser.)
5. Create a directory src/cascalog_pre_format/query_novel.clj
with the following contents:
(ns cascalog-pre-format.query-novel
(:require [clojure.string :as s]
[cascalog.api :refer :all]
[cascalog.more-taps :refer [hfs-delimited]])
(:gen-class))
(defmapcatfn split [line]
"Reads in a line of string and splits it by regex."
(filter #(not (empty? %)) (map clojure.string/trim (s/split line #"."))))
(defmapcatfn filter-condition [sentence]
"Find matching examples."
(if
(and
(> (count (re-seq #"Jim" sentence)) 0)
(> (count (re-seq #"free man" sentence)) 0))
(list sentence)))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?filtered]
((hfs-delimited in :skip-header? false) ?line)
(split ?line :> ?sentence)
(filter-condition ?sentence :> ?filtered)))
6. Create another file src/cascalog_pre_format/novel.clj
with the following contents:
(ns cascalog-pre-format.format-novel
(:require [clojure.string :as s])
(:gen-class))
(defn -main [file & args]
; read the whole file
(let [book (slurp file)
; Replace all the newlines trailed by lowercase letters by spaces
; note the challenge is "I" in a sentence :) (or any other proper noun)
book-stripped-newlines (s/join " " (clojure.string/split book
#"
(?=[a-z])"))
; Split the string into a vector based on periods followed by space
and non-breaking space
book-sentences (s/split book-stripped-newlines #"(?<=.)x20")
; join the vector into a single string delimited by newlines
book-sentences-flat (s/join "
" book-sentences)]
; write the whole file
(spit "./data/tuples.txt" book-sentences-flat)))
Let’s transform the data file we downloaded so that the sentences are all on one line. We’ll do this in Clojure.
1. Run the following command:
lein run -m cascalog-pre-format.format-novel "data/
AdventuresOfHuckleBerryFinn.txt"
2. Now we have a file of tuples with a sentence on each line. Now let’s run a Hadoop query on this. From a command prompt in the cascalog-pre-format
directory, enter the following:
rm –rf output
lein uberjar
hadoop jar target/query-novel.jar data/tuples.txt output/match
3. Check you’ve got results with
ls output
and ensure you see a match
directory.
4. Now run this command to open up the part-00000
:
cat output/match/part-00000
You should see the following:
"_Now_, old Jim, you're a free man again, and I bet you won't ever be a slave
no more Jim; and then you kept Tom here so long with the butter in his hat
that you come near spiling the whole business, because the men come before we
was out of the cabin, and we had to rush, and they heard us and let drive at
us, and I got my share, and we dodged out of the path and let them go by, and
when the dogs come they warn't interested in us, but went for the most noise,
and we got our canoe, and made for the raft, and was all safe, and Jim was a
free man, and we done it all by ourselves, and _wasn't_ it bully, Aunty!"
So we just implemented a query to find the big idea of Huckleberry Finn (freedom for Jim).
Notice that we’ve been using a strict tab-delimited file as input to our Hadoop process. This is a Cascading convention, the assumption being that your file is line-oriented and consists of one or more columns.
This is a reasonable assumption for reading financial information or log files. But what if your data set wasn’t in this format? Suppose you had a binary input format, or a complex data structure representing a protein fold, or simply a novel with sentences spread across multiple lines.
To use the Cascading library, there needs to be an extra step prior to the map reduce concept. You’d need to preformat your data so our process at a high level would look like this:
The particular thing to notice in this code is that we’ve used a regular Clojure function inside the Cascalog query at split
to transform the data line by line. Notice the defmapcatfn
for filter-condition
. We’ve also used a Clojure function filter-condition
to filter the Cascalog tuples so that only matching rows are returned.
If you open the file from the data directory and scroll halfway down, you’ll see text that looks something like this:
she begun to cry, though I couldn't hear her, and her back was to me. I
slid out, and as I passed the dining-room I thought I'd make sure them
watchers hadn't seen me; so I looked through the crack, and everything
was all right. They hadn't stirred.
I slipped up to bed, feeling ruther blue, on accounts of the thing
playing out that way after I had took so much trouble and run so much
resk about it. Says I, if it could stay where it is, all right; because
This text formatting comes across as conventional to you and me but not to the computer, so it is worth spelling out that
the sentences all have their ordinary endings of a period (or question mark),
the sentences run across lines,
the paragraphs are broken up by two newline characters,
the sentences start with a capital letter, and
the lines that start with a capital letter after a full stop on the previous line are the start of a new sentence.
3.145.51.153