In this chapter we cover loading a data file in Cascalog.
In this chapter we assume you have Leiningen set up.
The benefit of this chapter is understanding and applying the concept that Hadoop is a batch processing system. In order to process data, Hadoop must load it first. This chapter explains loading data.
So far we’ve been working with a data structure defined in memory. Now we’ll work with one that is defined in a file.
1. Create a new Leiningen project cascalog-load-file
in your projects directory, and change to that directory:
lein new app cascalog-load-file
cd cascalog-load-file
2. Put the following in your projects.clj
file:
(defproject cascalog-load-file "0.1.0-SNAPSHOT"
:description "Demo loading a file into Cascalog"
:uberjar-name "cascalog-load-file.jar"
:repositories {"conjars" "http://conjars.org/repo/"}
:main cascalog-load-file.load-file
:dependencies [[org.clojure/clojure "1.7.0-RC1"]
[cascading/cascading-hadoop2-mr1 "2.7.0" ]
[cascalog/cascalog-core "2.1.1"]
[cascalog/cascalog-more-taps "2.1.1"]]
:profiles {:provided
{:dependencies
[[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
[org.apache.hadoop/hadoop-common "2.7.0"]]}})
3. Create inside a new directory data
the file data/prices.txt
with the following contents:
stock_symbol price
APPL 527.00
MSFT 26.74
YHOO 19.86
FB 28.76
AMZN 259.15
Ensure that there is a header row, that a single tab delimits the two columns, that there are no trailing spaces, and that there is a trailing carriage return.
4. Now put the following code into a new file called src/cascalog_load_file/load_file.clj
:
(ns cascalog-load-file.load-file
(:require [cascalog.api :refer :all]
[cascalog.more-taps :refer [hfs-delimited]])
(:gen-class))
(defn -main [in]
(?<- (stdout)
[?doc ?line]
((hfs-delimited in :skip-header? true) ?doc ?line)))
You can see we’ve added a Cascalog Tap library (cascalog.more-taps
), and we’re referring to hfs-delimited
to load a file.
To test the solution, run these steps:
1. Let’s compile this:
lein uberjar
2. Now run it with the following command:
hadoop jar target/cascalog-load-file.jar data/prices.txt
You should see the following result (along with other noise):
...
RESULTS
-----------------------
APPL 527.00
MSFT 26.74
YHOO 19.86
FB 28.76
AMZN 259.15
-----------------------
Great! We’ve shown we can load data from a file into Cascalog using a Cascalog Tap.
3.146.34.146