Initializing Cascalog and Hadoop for distributed processing

Hadoop was developed by Yahoo! to implement Google's MapReduce algorithm, and then it was open sourced. Since then, it's become one of the most widely tested and used systems for creating distributed processing.

The central part of this ecosystem is Hadoop, but it's also complemented by a range of other tools, including the Hadoop Distributed File System (HDFS) and Pig, a language used to write jobs in order to run them on Hadoop.

One tool that makes working with Hadoop easier is Cascading. This provides a workflow-like layer on top of Hadoop that can make the expression of some data processing and analysis tasks much easier. Cascalog is a Clojure-idiomatic interface to Cascading and, ultimately, Hadoop.

This recipe will show you how to access and query data in Clojure sequences using Cascalog.

Getting ready

First, we have to list our dependencies in the Leiningen project.clj file:

(defproject distrib-data "0.1.0"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [cascalog "2.1.1"]
                 [org.slf4j/slf4j-api "1.7.7"]]
  :profiles {:dev
             {:dependencies
              [[org.apache.hadoop/hadoop-core "1.1.1"]]}})

Finally, we'll require the packages that we'll use, including the clojure.string library:

(require '[clojure.string :as string])
(require '[cascalog.logic.ops :as c])
(use 'cascalog.api)

How to do it…

Most part of this recipe will define the data we'll query. For this, we will use a list of companions and actors from the British television program Doctor Who. The data is in a sequence of maps, so we'll need to transform it into several sequences of vectors, which is what Cascalog can access. In one sequence there will be a list of the companions' lowercased names, for which we'll use keys in other data tables. One will be the name key and the full name, and the final one will be a table of the companions' keys to a doctor they tagged along with. We'll also define a list of the actors who played the role of doctors and the years in which they played them.

  1. At the time of writing this book, there are about 50 companions. I've only listed 10 here, but the examples might show the other companions. The full data is available in the source code. You can also download just the code to create this dataset from http://www.ericrochester.com/clj-data-analysis/data/companions.clj:
    (def input-data
      [{:given-name "Susan", :surname "Forman", :doctors [1]}
       {:given-name "Katarina", :surname nil, :doctors [1]}
       {:given-name "Victoria", :surname "Waterfield", :doctors [2]}
       {:given-name "Sarah Jane", :surname "Smith", :doctors [3 4 10]}
       {:given-name "Romana", :surname nil, :doctors [4]}
       {:given-name "Kamelion", :surname nil, :doctors [5]}
       {:given-name "Rose", :surname "Tyler", :doctors [9 10]}
       {:given-name "Martha", :surname "Jones", :doctors [10]}
       {:given-name "Adelaide", :surname "Brooke", :doctors [10]}
       {:given-name "Craig", :surname "Owens", :doctors [11]}])
    
    (def companion (map string/lower-case
                        (map :given-name input-data)))
    (def full-name
      (map (fn [{:keys [given-name surname]}]
             [(string/lower-case given-name)
              (string/trim
                (string/join space [given-name surname]))])
           input-data))
    (def doctor
      (mapcat #(map (fn [d] [(string/lower-case (:given-name %)) d])
                    (:doctors %))
              input-data))
    
    (def actor
      [[1 "William Hartnell" "1963–66"]
       [2 "Patrick Troughton" "1966–69"]
       [3 "Jon Pertwee" "1970–74"]
       [4 "Tom Baker" "1974–81"]
       [5 "Peter Davison" "1981–84"]
       [6 "Colin Baker" "1984–86"]
       [7 "Sylvester McCoy" "1987–89, 1996"]
       [8 "Paul McGann" "1996"]
       [9 "Christopher Eccleston" "2005"]
       [10 "David Tennant" "2005–10"]
       [11 "Matt Smith" "2010–present"]])
  2. We'll explain the syntax for the query in more detail in the How it works... section. In the meantime, let's just dive in:
    (?<- (stdout) [?companion] (companion ?companion))
  3. When you execute this, you might see a lot of logging messages from Hadoop. Towards the end, you should see this:
    RESULTS
    -----------------------
    susan
    barbara
    ian
    vicki
    steven
    …
  4. You can also query the other tables, as follows:
    (?<- (stdout) [?name] (full-name _ ?name))
    …
    RESULTS
    -----------------------
    Susan Forman
    Barbara Wright
    Ian Chesterton
    Vicki
    Steven Taylor
    …

How it works…

The structure of query statements is not hard to understand. Let's break one query statement apart:

(?<- (stdout) [?name] (full-name _ ?name))

The ?<- operator creates a query and executes it. It's a combination of the <- macro, which creates a query from output variables and predicates, and the ?- function, which executes a query to a sink.

(?<- (stdout) [?name] (full-name _ ?name))

The first parameter is a Cascading tap sink. This is a destination for the data. Obviously, if there's a lot of data being output, just dumping it in the console won't be a good idea. In that case, you can send it to a file. Since there's not much data, we'll just write it to the screen.

(?<- (stdout) [?name] (full-name _ ?name))

The preceding is a vector of output variables. The names here must occur in the predicates that follow.

(?<- (stdout) [?name] (full-name _ ?name))

This is a list of predicates. In this example, there's only one predicate. It queries the full-name table. It doesn't care about the values in the first column, so it just uses an underscore as a placeholder (_). Using underscore as a variable name in this way is a convention in Clojure and similar languages for values that you want to ignore. The values in the second column are bound to the name ?name, which is also found in the vector of output columns.

Of course, working with in-memory data isn't that useful. It's good for development and debugging, though. Later, we'll see how to connect to a datafile, and the query syntax is exactly the same.

See also

For information on how to access data that's not held in memory check out the following recipes:

  • Distributing data with Apache HDFS
  • Parsing CSV files with Cascalog
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.144.56