Other Apache projects

Whether you use a bundled distribution or stick with the base Apache Hadoop download, you will encounter many references to other, related Apache projects. We have covered Hive, Sqoop, and Flume in this book; we'll now highlight some of the others.

Note that this coverage seeks to point out the highlights (from my perspective) as well as give a taste of the wide range of the types of projects available. As before, keep looking out; there will be new ones launching all the time.

HBase

Perhaps the most popular Apache Hadoop-related project that we didn't cover in this book is HBase ; its homepage is at http://hbase.apache.org. Based on the BigTable model of data storage publicized by Google in an academic paper (sound familiar?), HBase is a non-relational data store sitting atop HDFS.

Whereas both MapReduce and Hive tasks focus on batch-like data access patterns, HBase instead seeks to provide very low latency access to data. Consequently, HBase can, unlike the already mentioned technologies, directly support user-facing services.

The HBase data model is not the relational approach we saw used in Hive and all other RDBMSs. Instead, it is a key-value, schemaless solution that takes a column-oriented view of data; columns can be added at run-time and depend on the values inserted into HBase. Each lookup operation is then very fast as it is effectively a key-value mapping from the row key to the desired column. HBase also treats timestamps as another dimension on the data, so one can directly retrieve data from a point in time.

The data model is very powerful but does not suit all use cases, just as the relational model isn't universally applicable. But if you have a need for structured low-latency views on large-scale data stored in Hadoop, HBase is absolutely something you should look at.

Oozie

We have said many times that Hadoop clusters do not live in a vacuum and need to integrate with other systems and into broader workflows. Oozie, available at http://oozie.apache.org, is a Hadoop-focused workflow scheduler that addresses this latter scenario.

In its simplest form, Oozie provides mechanisms to schedule the execution of MapReduce jobs based either on a time-based criteria (for example, do this every hour) or on data availability (for example, do this when new data arrives in this location). It allows the specification of multi-stage workflows that can describe a complete end-to-end process.

In addition to straight-forward MapReduce jobs, Oozie can also schedule jobs that run Hive or Pig commands as well as tasks entirely outside of Hadoop (such as sending emails, running shell scripts, or running commands on remote hosts).

There are many ways of building workflows; a common approach is with Extract Transform and Load (ETL) tools such as Pentaho Kettle (http://kettle.pentaho.com) and Spring Batch (http://static.springsource.org/spring-batch). These, for example, do include some Hadoop integration but the traditional dedicated workflow engines may not. Consider Oozie if you are building workflows with significant Hadoop interaction and you do not have an existing workflow tool with which you have to integrate.

Whir

When looking to use cloud services such as Amazon AWS for Hadoop deployments, it is usually a lot easier to use a higher-level service such as ElasticMapReduce as opposed to setting up your own cluster on EC2. Though there are scripts to help, the fact is that the overhead of Hadoop-based deployments on cloud infrastructures can be involved. That is where Apache Whir from http://whir.apache.org comes in.

Whir is not focused on Hadoop; it is about supplier-independent instantiation of cloud services of which Hadoop is a single example. Whir provides a programmatic way of specifying and creating Hadoop-based deployments on cloud infrastructures in a way that handles all the underlying service aspects for you. And it does this in a provider-independent fashion so that once you've launched on, say, EC2, you can use the same code to create the identical setup on another provider such as Rackspace or Eucalyptus. This makes vendor lock-in—often a concern with cloud deployments—less of an issue.

Whir is not quite there yet. Today it is limited in what services it can create and only supports a single provider, AWS. However, if you are interested in cloud deployment with less pain, it is worth watching its progress.

Mahout

The previous projects are all general-purpose in that they provide a capability that is independent of any area of application. Apache Mahout, located at http://mahout.apache.org, is instead a library of machine learning algorithms built atop Hadoop and MapReduce.

The Hadoop processing model is often well suited for machine learning applications where the goal is to extract value and meaning from a large dataset. Mahout provides implementations of such common ML techniques as clustering and recommenders.

If you have a lot of data and need help finding the key patterns, relationships, or just the needles in the haystack, Mahout may be able to help.

MRUnit

The final Apache Hadoop project we will mention also highlights the wide range of what is available. To a large extent, it does not matter how many cool technologies you use and which distribution if your MapReduce jobs frequently fail due to latent bugs. The recently promoted MRUnit from http://mrunit.apache.org can help here.

Developing MapReduce jobs can be difficult, especially in the early days, but testing and debugging them is almost always hard. MRUnit takes the unit test model of its namesakes such as JUnit and DBUnit and provides a framework to help write and execute tests that can help improve the quality of your code. Build up a test suite, integrate with automated test, and build tools, and suddenly, all those software engineering best practices that you wouldn't dream of not following when writing non-MapReduce code are available here also.

MRUnit may be of interest, well, if you ever write any MapReduce jobs. In my humble opinion, it's a really important project; please check it out.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.125.51