16. Writing Out a Data File with Cascalog

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

16. Writing Out a Data File with Cascalog

In this chapter we cover writing a data file out of Cascalog.

Assumptions

In this chapter we assume you have Leiningen set up.

Benefits

The benefit of this chapter is understanding and applying the fact that Hadoop is a batch processing system that writes the results of its computation to a data sink somewhere. In this chapter, we’ll write it to a file.

The Recipe—Code

Now we’ll look at reporting the results out of Cascalog to a file.

1. Create a new Leiningen project cascalog-file-output in your projects directory, and change to that directory:

Click here to view code image

lein new app cascalog-file-output
cd cascalog-file-output

2. Ensure that the following is in your projects.clj file:

Click here to view code image

(defproject cascalog-file-output "0.1.0-SNAPSHOT"
  :uberjar-name "cascalog-file-output.jar"
  :main cascalog-file-output.report-file
  :repositories  {"conjars" "http://conjars.org/repo/"}
  :dependencies [[org.clojure/clojure "1.7.0-RC1"]
                 [cascading/cascading-hadoop2-mr1 "2.7.0" ]
                 [cascalog/cascalog-core "2.1.1"]
                 [cascalog/cascalog-more-taps "2.1.1"]]
  :profiles {:provided
             {:dependencies
              [[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
               [org.apache.hadoop/hadoop-common "2.7.0"]]}})

3. Now copy the data file from your previous project into a new data directory in your current project:

Click here to view code image

mkdir data
cp ../cascalog-load-file/data/prices.txt data

4. Now create a new file src/cascalog_file_output/report_file.clj with the following contents:

Click here to view code image

(ns cascalog-file-output.report-file
  (:require [cascalog.logic.ops :as c]
            [cascalog.api :refer :all]
            [cascalog.more-taps :refer [hfs-delimited]])
  (:gen-class))

(defn -main [in out & args]
  (let [price-list (<- [?price]
             ((hfs-delimited in :skip-header? true) ?stock-symbol
?price-string)
             (read-string ?price-string :> ?price))]
  (?<- (hfs-delimited out)
         [?avg]
         (price-list ?prices)
         (c/avg ?prices :> ?avg))))

Testing the Solution

In order to test the solution, follow these steps:

1. Ensure that your command line is in the cascalog-file-output directory, and we’ll compile and run it. Run the following commands:

Click here to view code image

rm –rf output
lein uberjar
hadoop jar target/cascalog-file-output.jar data/prices.txt
output/average

You’ll notice that we deleted the output directory even though it shouldn’t have existed yet. Hadoop will fall over if this directory exists, and it is possible you didn’t get it all correct the first time, so this is a good habit to get into.

If all went according to plan, you will not see a RESULTS section in the output.

2. We’ll go look at the output now. In the command line do the following:

ls output/average

This should give the following result:

_SUCCESS part-00000

3. Now run the command:

cat output/average/part-00000

You should see the following result:

172.302

This is the result we expected.

Notes on the Recipe

You’ll recall our subquery pattern and building an average from the Introduction to Cascalog in Chapter 13. There are two new things here. The first is the (hfs-delimited out), which turns out to be an almost drop-in replacement for the (stdout) we were using.

The other is the addition of the line that begins (read-string. This is used because all the items read from the file are a string, and the average function can’t operate on strings. So we use the Clojure reader (as used in the REPL) to transform this into a Clojure object that represents a number.

Conclusion

Now we’ve written a data file out of Cascalog. Next we’ll expand on this to structure a file to load into Cascalog.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 16. Writing Out a Data File with Cascalog

Create new playlist

Sign In

Sign Up

16. Writing Out a Data File with Cascalog

Assumptions

Benefits

The Recipe—Code

Testing the Solution

Notes on the Recipe

Conclusion

Table of Contents for
16. Writing Out a Data File with Cascalog