In this chapter we cover writing a data file out of Cascalog.
In this chapter we assume you have Leiningen set up.
The benefit of this chapter is understanding and applying the fact that Hadoop is a batch processing system that writes the results of its computation to a data sink somewhere. In this chapter, we’ll write it to a file.
Now we’ll look at reporting the results out of Cascalog to a file.
1. Create a new Leiningen project cascalog-file-output
in your projects directory, and change to that directory:
lein new app cascalog-file-output
cd cascalog-file-output
2. Ensure that the following is in your projects.clj
file:
(defproject cascalog-file-output "0.1.0-SNAPSHOT"
:uberjar-name "cascalog-file-output.jar"
:main cascalog-file-output.report-file
:repositories {"conjars" "http://conjars.org/repo/"}
:dependencies [[org.clojure/clojure "1.7.0-RC1"]
[cascading/cascading-hadoop2-mr1 "2.7.0" ]
[cascalog/cascalog-core "2.1.1"]
[cascalog/cascalog-more-taps "2.1.1"]]
:profiles {:provided
{:dependencies
[[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
[org.apache.hadoop/hadoop-common "2.7.0"]]}})
3. Now copy the data
file from your previous project into a new data
directory in your current project:
mkdir data
cp ../cascalog-load-file/data/prices.txt data
4. Now create a new file src/cascalog_file_output/report_file.clj
with the following contents:
(ns cascalog-file-output.report-file
(:require [cascalog.logic.ops :as c]
[cascalog.api :refer :all]
[cascalog.more-taps :refer [hfs-delimited]])
(:gen-class))
(defn -main [in out & args]
(let [price-list (<- [?price]
((hfs-delimited in :skip-header? true) ?stock-symbol
?price-string)
(read-string ?price-string :> ?price))]
(?<- (hfs-delimited out)
[?avg]
(price-list ?prices)
(c/avg ?prices :> ?avg))))
In order to test the solution, follow these steps:
1. Ensure that your command line is in the cascalog-file-output
directory, and we’ll compile and run it. Run the following commands:
rm –rf output
lein uberjar
hadoop jar target/cascalog-file-output.jar data/prices.txt
output/average
You’ll notice that we deleted the output directory even though it shouldn’t have existed yet. Hadoop will fall over if this directory exists, and it is possible you didn’t get it all correct the first time, so this is a good habit to get into.
If all went according to plan, you will not see a RESULTS
section in the output.
2. We’ll go look at the output now. In the command line do the following:
ls output/average
This should give the following result:
_SUCCESS part-00000
3. Now run the command:
cat output/average/part-00000
You should see the following result:
172.302
This is the result we expected.
You’ll recall our subquery pattern and building an average from the Introduction to Cascalog in Chapter 13. There are two new things here. The first is the (hfs-delimited out)
, which turns out to be an almost drop-in replacement for the (stdout)
we were using.
The other is the addition of the line that begins (read-string
. This is used because all the items read from the file are a string, and the average function can’t operate on strings. So we use the Clojure reader (as used in the REPL) to transform this into a Clojure object that represents a number.
3.138.120.187