15. Loading a Data File into Cascalog

In this chapter we cover loading a data file in Cascalog.

Assumptions

In this chapter we assume you have Leiningen set up.

Benefits

The benefit of this chapter is understanding and applying the concept that Hadoop is a batch processing system. In order to process data, Hadoop must load it first. This chapter explains loading data.

The Recipe—Code

So far we’ve been working with a data structure defined in memory. Now we’ll work with one that is defined in a file.

1. Create a new Leiningen project cascalog-load-file in your projects directory, and change to that directory:

lein new app cascalog-load-file
cd cascalog-load-file

2. Put the following in your projects.clj file:

(defproject cascalog-load-file "0.1.0-SNAPSHOT"
  :description "Demo loading a file into Cascalog"
  :uberjar-name "cascalog-load-file.jar"
  :repositories  {"conjars" "http://conjars.org/repo/"}
  :main cascalog-load-file.load-file
  :dependencies [[org.clojure/clojure "1.7.0-RC1"]
                 [cascading/cascading-hadoop2-mr1 "2.7.0" ]
                 [cascalog/cascalog-core "2.1.1"]
                 [cascalog/cascalog-more-taps "2.1.1"]]
  :profiles {:provided
             {:dependencies
              [[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
               [org.apache.hadoop/hadoop-common "2.7.0"]]}})

3. Create inside a new directory data the file data/prices.txt with the following contents:

stock_symbol     price
APPL             527.00
MSFT             26.74
YHOO             19.86
FB               28.76
AMZN             259.15

Ensure that there is a header row, that a single tab delimits the two columns, that there are no trailing spaces, and that there is a trailing carriage return.

4. Now put the following code into a new file called src/cascalog_load_file/load_file.clj:

(ns cascalog-load-file.load-file
  (:require [cascalog.api :refer :all]
            [cascalog.more-taps :refer [hfs-delimited]])
  (:gen-class))

(defn -main [in]
  (?<- (stdout)
       [?doc ?line]
       ((hfs-delimited in :skip-header? true) ?doc ?line)))

You can see we’ve added a Cascalog Tap library (cascalog.more-taps), and we’re referring to hfs-delimited to load a file.

Testing the Solution

To test the solution, run these steps:

1. Let’s compile this:

lein uberjar

2. Now run it with the following command:

hadoop jar target/cascalog-load-file.jar data/prices.txt

You should see the following result (along with other noise):

...
RESULTS
-----------------------
APPL    527.00
MSFT    26.74
YHOO    19.86
FB      28.76
AMZN    259.15
-----------------------

Great! We’ve shown we can load data from a file into Cascalog using a Cascalog Tap.

Conclusion

In this chapter we’ve loaded a file into Cascalog.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.34.146