16. Writing Out a Data File with Cascalog

In this chapter we cover writing a data file out of Cascalog.

Assumptions

In this chapter we assume you have Leiningen set up.

Benefits

The benefit of this chapter is understanding and applying the fact that Hadoop is a batch processing system that writes the results of its computation to a data sink somewhere. In this chapter, we’ll write it to a file.

The Recipe—Code

Now we’ll look at reporting the results out of Cascalog to a file.

1. Create a new Leiningen project cascalog-file-output in your projects directory, and change to that directory:

lein new app cascalog-file-output
cd cascalog-file-output

2. Ensure that the following is in your projects.clj file:

 (defproject cascalog-file-output "0.1.0-SNAPSHOT"
  :uberjar-name "cascalog-file-output.jar"
  :main cascalog-file-output.report-file
  :repositories  {"conjars" "http://conjars.org/repo/"}
  :dependencies [[org.clojure/clojure "1.7.0-RC1"]
                 [cascading/cascading-hadoop2-mr1 "2.7.0" ]
                 [cascalog/cascalog-core "2.1.1"]
                 [cascalog/cascalog-more-taps "2.1.1"]]
  :profiles {:provided
             {:dependencies
              [[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
               [org.apache.hadoop/hadoop-common "2.7.0"]]}})

3. Now copy the data file from your previous project into a new data directory in your current project:

mkdir data
cp ../cascalog-load-file/data/prices.txt data

4. Now create a new file src/cascalog_file_output/report_file.clj with the following contents:

(ns cascalog-file-output.report-file
  (:require [cascalog.logic.ops :as c]
            [cascalog.api :refer :all]
            [cascalog.more-taps :refer [hfs-delimited]])
  (:gen-class))

(defn -main [in out & args]
  (let [price-list (<- [?price]
             ((hfs-delimited in :skip-header? true) ?stock-symbol
?price-string)
             (read-string ?price-string :> ?price))]
  (?<- (hfs-delimited out)
         [?avg]
         (price-list ?prices)
         (c/avg ?prices :> ?avg))))

Testing the Solution

In order to test the solution, follow these steps:

1. Ensure that your command line is in the cascalog-file-output directory, and we’ll compile and run it. Run the following commands:

rm –rf output
lein uberjar
hadoop jar target/cascalog-file-output.jar data/prices.txt
output/average

You’ll notice that we deleted the output directory even though it shouldn’t have existed yet. Hadoop will fall over if this directory exists, and it is possible you didn’t get it all correct the first time, so this is a good habit to get into.

If all went according to plan, you will not see a RESULTS section in the output.

2. We’ll go look at the output now. In the command line do the following:

ls output/average

This should give the following result:

_SUCCESS    part-00000

3. Now run the command:

cat output/average/part-00000

You should see the following result:

172.302

This is the result we expected.

Notes on the Recipe

You’ll recall our subquery pattern and building an average from the Introduction to Cascalog in Chapter 13. There are two new things here. The first is the (hfs-delimited out), which turns out to be an almost drop-in replacement for the (stdout) we were using.

The other is the addition of the line that begins (read-string. This is used because all the items read from the file are a string, and the average function can’t operate on strings. So we use the Clojure reader (as used in the REPL) to transform this into a Clojure object that represents a number.

Conclusion

Now we’ve written a data file out of Cascalog. Next we’ll expand on this to structure a file to load into Cascalog.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.187