Other programming abstractions

Hadoop is not just extended by additional functionality; there are tools to provide entirely different paradigms for writing the code used to process your data within Hadoop.

Pig

We mentioned Pig (http://pig.apache.org) in Chapter 8, A Relational View on Data with Hive, and won't say much else about it here. Just remember that it is available and may be useful if you have processes or people for whom a data flow definition of the Hadoop processes is a more intuitive or better fit than writing raw MapReduce code or HiveQL scripts. Remember that the major difference is that Pig is an imperative language (it defines how the process will be executed), while Hive is more declarative (defines the desired results but not how they will be produced).

Cascading

Cascading is not an Apache project but is open source and is available at http://www.cascading.org. While Hive and Pig effectively define different languages with which to express data processing, Cascading provides a set of higher-level abstractions.

Instead of thinking of how multiple MapReduce jobs may process and share data with Cascading, the model is a data flow using pipes and multiple joiners, taps, and similar constructs. These are built programmatically (the core API was originally Java, but there are numerous other language bindings), and Cascading manages the translation, deployment, and execution of the workflow on the cluster.

If you want a higher-level interface to MapReduce and the declarative style of Pig and Hive doesn't suit, the programmatic model of Cascading may be what you are looking for.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.158.36