Chapter 8. Segue

Welcome to the last of the book’s recipes for R parallelism. This will be a short chapter, but don’t let that fool you: Segue’s scope is intentionally narrow. This focus makes it a particularly powerful tool.

Segue’s mission is as simple as it gets: make it easy to use Elastic MapReduce as a parallel backend for lapply()-style operations. So easy, in fact, that it boasts of doing this in only two lines of R code.[59]

This narrow focus is no accident. Segue’s creator, JD Long, wanted occasional access to a Hadoop cluster to run his pleasantly parallel,[60] computationally expensive models. Elastic MapReduce was a great fit but still a bit cumbersome for his workflow. He created Segue to tackle the grunt work so he could focus on his higher-level modeling tasks.

Segue is a relatively young package. Nonetheless, since its creation in 2010, it has attracted a fair amount of attention.

Quick Look

Motivation: You want Hadoop power to drive some lapply() loops, perhaps for a parameter sweep, but you want minimal Hadoop contact. You consider MapReduce to be too much of a distraction from your work.

Solution: Use the segue package’s emrlapply() to send your calculations up to Elastic MapReduce, the Amazon Web Services cloud-based Hadoop product.

Good because: You get to focus on your modelling work, while segue takes care of transforming your lapply() work into a Hadoop job.

How It Works

Segue takes care of launching the Elastic MapReduce cluster, shipping data back and forth, and all other such housekeeping. As such, it abstracts you from having to know much about Hadoop, and even Elastic MapReduce. Your monthly bill from Amazon Web Services will be your only real indication that you’ve done anything beyond standard R.

Still, there is a catch: I emphasize that Segue is designed for CPU-intensive jobs across a large number of inputs, such as parameter sweeps. If you have data-intensive work, or only a few inputs, Segue will not shine. Also, Segue works only with Elastic MapReduce. It cannot talk to your in-house Hadoop cluster.

Setting Up

Segue requires that you have an AWS account. (Be sure to enable the Elastic MapReduce service.) If you haven’t already done this, you’ll want to grab your preferred credit card and head over to http://aws.amazon.com/.

I’d also suggest you run one of Amazon’s sample Elastic MapReduce jobs so you can familiarize yourself with the AWS console. It will come in handy later, when you double-check that your cluster has indeed shutdown. I’ll discuss that part shortly. For the remainder of this chapter, though, I’ll assume you’re familiar with AWS concepts.

Next, install Segue. It isn’t available on CRAN, so grab the source bundle from the project website at http://code.google.com/p/segue/ and run:

R CMD INSTALL {file}

from your OS command line.

Note

As of this writing, Segue does not run under Windows.

Working with It

Model Testing: Parameter Sweep

Segue has only one use case, so I have just one example to show you.

Situation: You’re doing a parameter sweep across a large number of inputs, and running it locally using lapply() just takes too long.

The code: To set the stage, let’s say you have a function runModel() that takes a single list as input. You wrap up the entire set of inputs in a parent list (that is, a list-of-lists) called input.list. To execute runModel() on each sub-list, you could use the standard lapply() like so:

runModel <- function( params ){ ... }
input.list <- list( ... each element is also a list ... )

lapply.result <- lapply( input.list , runModel )

So far, this is nothing new, and it works fine for most cases. If input.list contains enough elements and each iteration of runModel() takes a few minutes, though, this exercise could run for several hours on your local workstation. We’ll show you how to transform that lapply() call into the Segue equivalent.

Segue setup:

library(segue)

setCredentials( "your AWS access key", "your AWS secret key" ) 1

emr.handle <- createCluster( 2
        numInstances=6 ,
        ec2KeyName="...your AWS SSH key..."
)

This first R excerpt prepares your environment by loading the Segue library and launching your cluster. Of note:

1

The call to setCredentials() accepts your AWS credentials. Understandably, not everyone wants to embed passwords in a script or type them into the R console (where they’ll end up in your .Rhistory file). As an alternative, Segue can pull those values from the environment variables AWSACCESSKEY and AWSSECRETKEY. Be sure to define these variables before you launch R.

2

createCluster() connects to Amazon and builds your cluster. The numInstances parameter specifies the number of nodes (machine instances) in the cluster. A value of 1 means all the work will take place on a single node, a combined master and worker. For some larger value N, there will always be N-1 worker nodes and one master node. In other words, there’s not much difference between numInstances=1 and numInstances=2 since you’ll have just a single worker node in either case.

Why, then, would a person want numInstances=1? You could use this for testing, or for those cases in which you just want a separate machine to do the heavy lifting. Consider your local machine is a netbook or some other resource-constrained hardware. You could use Segue to offload the big calculations to a single-node cluster, then use your local machine for simple plots and smaller calculations.

Note

Recall that Hadoop splits up your input to distribute work throughout the cluster. Segue’s author recommends at least ten input items for each worker node. A smaller input list may lead to an imbalance, with one node taking on most of the work.

createCluster() will print log messages while the launch and bootstrap take place, and return control to your R console once the cluster is ready for action. Its return value emr.handle is a handle to the remote EMR cluster. Save this, as you’ll need it to send work to the cluster, and also to shut it down later.

Now, you’re ready to run the lapply() loop on the cluster, Segue-style:

emr.result <- emrlapply(emr.handle , input.list , runModel, taskTimeout=10 )

emrlappy() looks and acts very much like an lapply() call, doesn’t it? The only new parameters are the cluster handle emr.handle and the task timeout taskTimeout. (I discussed Hadoop task timeouts in Chapter 6.) Here, the timeout is set to ten minutes.

Behind the scenes, Segue has packed up your data, shipped it to the cloud, run the job, collected the output data, and brought it back to you safe and sound. As far as you can see from your R console, though, nothing out of the ordinary has happened. Such is the beauty of Segue.

Reviewing the output: There’s surprisingly little to explain as far as reviewing output from emrlapply(): if you feed lapply() and emrlapply() the same input, they should return the same values. That means emrlapply() is an almost seamless replacement for lapply() in your typical R workflow (at least for lapply() calls that have suitably sized inputs). We emphasize “almost” seamless, because there is one catch: emrlapply() expects a plain list as input, whereas lapply() will attempt to munge a non-list to a list.

Note

Because lapply() and emrlapply() are so similar, you can test your code on a small sample set using the former, before launching a cluster to run the latter.

Speaking of workflows, I’d like to emphasize that you can use the same cluster handle for many calls to emrlapply(). You don’t have to launch a new cluster for each emrlapply() call. For example, you could use Segue to launch a cluster in the morning, call emrlapply() several times during the day, and then shut down the cluster in the evening. This is very important to know, since that initial call to createCluster() can take several minutes to return. You probably don’t want to do that several times a day.

Eventually, though, you’ll run out of work to do, at which point you’ll want to shutdown your EMR cluster. Simply call Segue’s stopCluster():

stopCluster(emr.handle)

Keep in mind that there are only two ways to terminate the cluster:

  • Call stopCluster() from your R console

  • Use AWS tools (such as the AWS web console, or the command-line Elastic MapReduce toolset) to terminate the EMR job

Did you notice something missing? “Quit R” is not on this list, because closing R will not terminate the cluster.

To spare you an unexpectedly large AWS bill, I’ll even put this inside a warning box for you:

Warning

The cluster will keep running until you actively shut it down by terminating the EMR job. Even if you close R, or your local workstation crashes, your EMR cluster will keep running and Amazon will continue to bill you for the time.

This is one reason to you familiarize yourself with the Elastic MapReduce tab of the AWS console: should your local workstation crash, or you otherwise lose the cluster handle returned by startCluster(), you’ll have to manually terminate the cluster.

When It Works…

Segue very much abstracts you from Hadoop, Elastic MapReduce, and even Amazon Web Services. As such, it is the “most R” (or, if you prefer, “least Hadoop”) of the Hadoop-related strategies presented in this book. If your goal is to run a large lapply()-style calculation and get on with the rest of your R work, Segue wins hands-down compared to R+Hadoop and RHIPE.

…And When It Doesn’t

Tied to Amazon’s cloud: Segue only works with Elastic MapReduce. This means it won’t help you take advantage of your in-house Hadoop cluster, or your self-managed cluster in the cloud.

(This is part of why Segue isn’t helpful for data-intensive work: data transformation and transfer to and from Amazon’s cloud would counteract the benefits of making your lapply() loop run in parallel.)

Requires extra responsibility: By default, Elastic MapReduce builds an ephemeral cluster that lasts only as long as a job’s runtime. Segue, on the other hand, tells EMR to leave the cluster running after the first job completes. That leaves it to you to check and double-check that you’ve truly terminated the cluster when you’re done running jobs.

Granted, this is a concern when you use any cloud resources. I mention it here because Segue shields you so well from the cluster build-out that it’s easy to forget you’ve left anything running. Beware, then, and make a habit to regularly check your AWS account’s billing page.[61]

Limited scope: Segue has just one use case, and that narrow focus is its blessing as well as its curse. If all you want is to turbo-charge your lapply() loops with little distraction from your everyday R work, Segue is a great fit. If you need anything else, or if you live to twiddle Hadoop clusters by hand, Segue will not make you very happy.

The Wrap-up

Segue is designed to do one thing, and do it well: use Amazon’s Elastic MapReduce as a backend for lapply()-style work. It abstracts you from Hadoop and other technical details, which makes it useful for people who find cluster management a real distraction from their R work.



[59] Segue’s original slogan was a bit spicier, which is a nice way of saying that it’s not printable. JD has since softened the message, but it’s hard not to appreciate the original slogan’s enthusiasm.

[60] Sometimes known as “embarrassingly parallel,” though we can’t fathom what could possibly be embarassing about parallel computation.

[61] I was once bitten by cluster resources gone awry (not related to Segue) and have since made a habit of checking the AWS Billing page on a regular basis. Amazon, if you’re listening: could you please provide an API to check billing? Thank you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.110.183