14. Cascalog and Hadoop

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

13. Introduction to Cascalog

15. Loading a Data File into Cascalog

14. Cascalog and Hadoop

In this chapter we cover compiling a Cascalog 2.0 program to a jar file and running it in Hadoop 2.0.

Assumptions

In this chapter we assume the following:

You have Leiningen set up.

You have a projects directory where you create new Leiningen projects.

You know how to add a directory to your environment variable PATH so that it can be accessed from the command line.

Benefits

Hadoop can be a hosted service inside your company or through an external vendor such as Amazon or IBM. On a very simplistic level, you need to provide a jar to run your Hadoop application. This chapter helps you conform to that standard packaging and then walks you through using it.

The Recipe—Code

To keep things simple, we’ll take our existing query from the previous chapter and put it as-is into a jar file that we’ll feed into the Hadoop command line. To do that, we’ll set up a new Leiningen project.

1. Create a new Leiningen project cascalog-jar-demo in your projects directory, and change to that directory:

Click here to view code image

lein new app cascalog-jar-demo
cd cascalog-jar-demo

2. Modify the projects.clj to include the following:

Click here to view code image

(defproject cascalog-jar-demo "0.1.0-SNAPSHOT"
  :source-paths ["src"]
  :main cascalog_jar_demo.price_average
  :uberjar-name "cascalog-jar-demo.jar"
  :repositories  {"conjars" "http://conjars.org/repo/"}
  :dependencies [[org.clojure/clojure "1.7.0-RC1"]
                 [cascading/cascading-hadoop2-mr1 "2.7.0" ]
                 [cascalog/cascalog-core "2.1.1"]]
  :profiles {:provided
             {:dependencies
              [[org.apache.hadoop/hadoop-mapreduce-client-jobclient "2.7.0"]
               [org.apache.hadoop/hadoop-common "2.7.0"]]}})

3. Create a new file at the following location src/cascalog_jar_demo/price_average.clj with the following contents:

Click here to view code image

(ns cascalog_jar_demo.price_average
  (:require [cascalog.logic.ops :as c]
            [cascalog.api :refer :all])
  (:gen-class))

(def prices
  [;; [stock-symbol price]
   ["APPL" 527.00]
   ["MSFT" 26.74]
   ["YHOO" 19.86]
   ["FB" 28.76]
   ["AMZN" 259.15]])

(defn -main []
  (let [price-list (<- [?price]
                       (prices ?stock-symbol ?price))]
    (?<- (stdout)
         [?avg]
         (price-list ?prices)
         (c/avg ?prices :> ?avg))))

4. Before we start, we’ll make things easier for ourselves by limiting the output through log4j settings. Create a file called resources/log4j.properties with the following contents:

Click here to view code image

log4j.rootLogger=WARN, A1
log4j.logger.user=DEBUG
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d %-5p %c: %m%n

This is important. If you do not set up your file this way, you’ll get lots of junk in your output that obfuscates the results you’re looking for.

Testing the Solution

To test our Cascalog program in Hadoop, run the following steps:

1. First, ensure it all works by running it. Execute the following on a command line in the project directory:

lein run

You should see the following results:

RESULTS
-----------------------
172.302
-----------------------

2. Generate the jar with Leiningen:

lein uberjar

In the output you should see the following line:

Click here to view code image

Created /.../<projects>/cascalog-jar-demo/target/jar-demo.jar

3. You can check that it has been created correctly by changing a copy of the generated jar to a zip on the command line and opening it. The following shows how it appears on a Mac:

Click here to view code image

cd target
cp cascalog-jar-demo-0.1.0-SNAPSHOT.jar cascalog-jar-demo-0.1.0-SNAPSHOT.zip
open cascalog-jar-demo-0.1.0-SNAPSHOT.zip
cd cascalog-jar-demo-0.1.0-SNAPSHOT
cd cascalog_jar_demo/
ls

You should see something like the following:

Click here to view code image

core.clj
price_average$_main.class
price_average$loading__4784__auto__.class
price_average.class
price_average.clj
price_average__init.class

4. Also check that

Click here to view code image

.../<projects>/cascalog_jar_demo/target/ cascalog-jar-demo-0.1.0-SNAPSHOT/
META-INF/MANIFEST.MF

contains the line:

Click here to view code image

Main-Class: cascalog-jar-demo.price-average

5. If you don’t have all of the above, then you have a typo in your configuration somewhere. In particular, check that the dashes and underscores in the names are correct.

6. Run the new jar in Hadoop with the following command:

Click here to view code image

hadoop jar target/cascalog-jar-demo.jar

You should get the following result (along with other noise):

...
RESULTS
-----------------------
172.302
-----------------------

Conclusion

Now we’ve run our Cascalog program in Hadoop.

Postscript—Setting Up Hadoop on a Mac

These are the steps to getting Hadoop running on your Mac.

1. You’ll need to download and install Hadoop. You can download it by following the link from here (and select a mirror if required):

Click here to view code image

http://www.apache.org/dyn/closer.cgi/hadoop/common/

On a Mac you’re looking for a file on that page similar to this (with a later version):

hadoop-2.7.0.tar.gz

2. Next you need to expand the downloaded file and copy the resulting directory to a central location. The Mac should be able to expand the gz automatically on download or double-click. Next you need to copy the resulting directory to /usr/local on the file system. The final result should be a directory that looks like

/usr/local/hadoop-2.7.0

3. To add it to the command line, expand it. On a Mac add it to your path by modifying your ~/.profile file and adding two lines similar to the following:

Click here to view code image

#Hadoop
export PATH=/usr/local/hadoop-2.7.0/bin:$PATH

4. Now reload and check your .profile by running the following command:

source ~/.profile
echo $PATH

And ensure your new entry is in there. If it is not, you may need to check that the other entries in .profile are correct.

5. You can test that this has worked by opening a new command prompt and, without changing directories, enter the following command:

hadoop version

You should get a result like this:

Click here to view code image

Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /usr/local/hadoop-2.7.0/share/hadoop/common/
hadoop-common-2.7.0.jar

If you don’t get this result, you haven’t set it up correctly. Recheck the steps above, and also ensure you followed the setting-up Java steps in the initial Leiningen set-up presented in Chapter 1.

Postscript—Setting Up Hadoop on a Windows Machine

1. Confirm you have Java installed by following the steps at the end of the Leiningen in Chapter 1.

2. Go the page http://hadoop.apache.org/releases.html and download Hadoop (at time of writing, the latest was 2.7.0). You’ll need to choose a link that says “binary.”

3. Expand the download so it is a directory. (You can expand .gz files using 7-zip http://www.7-zip.org/.)

4. Copy the directory hadoop-2.7.0 to the c:util directory you created when you installed Leiningen.

5. Similarly to installing Tomcat, right-click My PC to the left of a Windows Explorer window, and select Properties.

6. Click Advanced System Settings.

7. Click the Environment Variables button.

8. Under System Variables, select Path and click the Edit button.

9. Add the path of the bin directory inside the Hadoop directory you have expanded, for example ;C:utilhadoop-2.7.0in;.

10. Click OK to close the Edit dialog.

11. Click OK to close the Environment Variables window.

12. Click OK to close the System Properties window.

13. Open a new command prompt (old ones will be stale) and enter the command

hadoop version

Note that there is no dash before the version.

You should expect a result similar to:

Click here to view code image

Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
d4c8d4d4d203c93
4e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /C:/Temp/util/hadoop-2.7.0/share/hadoop/common/
hadoop
-common-2.7.0.jar

If you didn’t get this, check that the JDK is installed correctly, described in the Leiningen installation in Chapter 1, and compare the steps here to the steps in the Tomcat installation in Chapter 2, for setting up Windows environment variables.

In addition, be aware that there are a number of bugs in the Windows installation of Hadoop as of the time of writing.

It appears that the maintainers of Hadoop are assuming that the Java installation is still in the top level of the C: drive rather than in the C:Program Files directory (which was used in Java 1.4).

To correct this, you’ll need to make three changes in the file hadoop-2.7.0libexechadoop-config.cmd.

1. On line 116, change this:

Click here to view code image

if not exist %JAVA_HOME%injava.exe (

to this:

Click here to view code image

if not exist "%JAVA_HOME%injava.exe" (

2. On line 122, change this:

set JAVA=%JAVA_HOME%injava

to this:

Click here to view code image

set JAVA="%JAVA_HOME%injava"

3. On line 195, after the line

Click here to view code image

for /f "delims=" %%A in ('%JAVA% -Xmx32m %HADOOP_JAVA_PLATFORM_OPTS%
-classpath "%CLASSPATH%" org.apache.hadoop.util.PlatformName') do set JAVA_
PLATFORM=%%A

add the line

Click here to view code image

set JAVA_PLATFORM=%JAVA% -Xmx32m %HADOOP_JAVA_PLATFORM_OPTS% -classpath
"%CLASSPATH%" org.apache.hadoop.util.PlatformName

That should do it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.12.166.255