Welcome to the last of the book’s recipes for R parallelism. This will be a short chapter, but don’t let that fool you: Segue’s scope is intentionally narrow. This focus makes it a particularly powerful tool.
Segue’s mission is as simple as it gets: make it easy to use Elastic
MapReduce as a parallel backend for lapply()
-style operations. So easy, in fact, that
it boasts of doing this in only two lines of R code.[59]
This narrow focus is no accident. Segue’s creator, JD Long, wanted occasional access to a Hadoop cluster to run his pleasantly parallel,[60] computationally expensive models. Elastic MapReduce was a great fit but still a bit cumbersome for his workflow. He created Segue to tackle the grunt work so he could focus on his higher-level modeling tasks.
Segue is a relatively young package. Nonetheless, since its creation in 2010, it has attracted a fair amount of attention.
Motivation: You want Hadoop power
to drive some lapply()
loops, perhaps
for a parameter sweep, but you want minimal Hadoop contact. You consider
MapReduce to be too much of a distraction from your work.
Solution: Use the segue
package’s emrlapply()
to send your calculations up to
Elastic MapReduce, the Amazon Web Services cloud-based Hadoop
product.
Good because: You get to focus on
your modelling work, while segue
takes
care of transforming your lapply()
work
into a Hadoop job.
Segue takes care of launching the Elastic MapReduce cluster, shipping data back and forth, and all other such housekeeping. As such, it abstracts you from having to know much about Hadoop, and even Elastic MapReduce. Your monthly bill from Amazon Web Services will be your only real indication that you’ve done anything beyond standard R.
Still, there is a catch: I emphasize that Segue is designed for CPU-intensive jobs across a large number of inputs, such as parameter sweeps. If you have data-intensive work, or only a few inputs, Segue will not shine. Also, Segue works only with Elastic MapReduce. It cannot talk to your in-house Hadoop cluster.
Segue requires that you have an AWS account. (Be sure to enable the Elastic MapReduce service.) If you haven’t already done this, you’ll want to grab your preferred credit card and head over to http://aws.amazon.com/.
I’d also suggest you run one of Amazon’s sample Elastic MapReduce jobs so you can familiarize yourself with the AWS console. It will come in handy later, when you double-check that your cluster has indeed shutdown. I’ll discuss that part shortly. For the remainder of this chapter, though, I’ll assume you’re familiar with AWS concepts.
Next, install Segue. It isn’t available on CRAN, so grab the source bundle from the project website at http://code.google.com/p/segue/ and run:
R CMD INSTALL {file}
from your OS command line.
As of this writing, Segue does not run under Windows.
Segue has only one use case, so I have just one example to show you.
Situation: You’re doing a
parameter sweep across a large number of inputs, and running it locally
using lapply()
just takes too
long.
The code: To set the stage,
let’s say you have a function runModel()
that takes a single list as input.
You wrap up the entire set of inputs in a parent list (that is, a
list-of-lists) called input.list
. To
execute runModel()
on each sub-list,
you could use the standard lapply()
like so:
runModel <- function( params ){ ... } input.list <- list( ... each element is also a list ... ) lapply.result <- lapply( input.list , runModel )
So far, this is nothing new, and it works fine for most cases. If
input.list
contains enough elements
and each iteration of runModel()
takes a few minutes, though, this exercise could run for several hours
on your local workstation. We’ll show you how to transform that lapply()
call into the Segue
equivalent.
Segue setup:
library(segue) setCredentials( "your AWS access key", "your AWS secret key" ) emr.handle <- createCluster( numInstances=6 , ec2KeyName="...your AWS SSH key..." )
This first R excerpt prepares your environment by loading the Segue library and launching your cluster. Of note:
The call to setCredentials()
accepts your AWS
credentials. Understandably, not everyone wants to embed passwords
in a script or type them into the R console (where they’ll end up in
your .Rhistory
file). As an
alternative, Segue can pull those values from the environment
variables AWSACCESSKEY
and
AWSSECRETKEY
. Be sure to define
these variables before you launch R.
createCluster()
connects to
Amazon and builds your cluster. The numInstances
parameter specifies the
number of nodes (machine instances) in the cluster. A value of
1
means all the work will take
place on a single node, a combined master and worker. For some
larger value N, there will always be
N-1 worker nodes and one master node. In other
words, there’s not much difference between numInstances=1
and numInstances=2
since you’ll have just a
single worker node in either case.
Why, then, would a person want numInstances=1
? You could use this for
testing, or for those cases in
which you just want a separate machine to do the heavy lifting. Consider
your local machine is a netbook or some other resource-constrained
hardware. You could use Segue to offload the big calculations to a
single-node cluster, then use your local machine for simple plots and
smaller calculations.
Recall that Hadoop splits up your input to distribute work throughout the cluster. Segue’s author recommends at least ten input items for each worker node. A smaller input list may lead to an imbalance, with one node taking on most of the work.
createCluster()
will print log
messages while the launch and bootstrap take place, and return control
to your R console once the cluster is ready for action. Its return value
emr.handle
is a handle to the remote
EMR cluster. Save this, as you’ll need it to send work to the cluster,
and also to shut it down later.
Now, you’re ready to run the lapply()
loop on the cluster,
Segue-style:
emr.result <- emrlapply(emr.handle , input.list , runModel, taskTimeout=10 )
emrlappy()
looks and acts very
much like an lapply()
call, doesn’t
it? The only new parameters are the cluster handle emr.handle
and the task timeout taskTimeout
. (I discussed Hadoop task timeouts
in Chapter 6.) Here, the timeout is set to ten
minutes.
Behind the scenes, Segue has packed up your data, shipped it to the cloud, run the job, collected the output data, and brought it back to you safe and sound. As far as you can see from your R console, though, nothing out of the ordinary has happened. Such is the beauty of Segue.
Reviewing the output: There’s
surprisingly little to explain as far as reviewing output from emrlapply()
: if you feed lapply()
and emrlapply()
the same input, they should return
the same values. That means emrlapply()
is an almost seamless replacement
for lapply()
in your typical R
workflow (at least for lapply()
calls
that have suitably sized inputs). We emphasize “almost” seamless,
because there is one catch: emrlapply()
expects a plain list
as input, whereas lapply()
will attempt to munge a non-list
to a list
.
Because lapply()
and emrlapply()
are so similar, you can test
your code on a small sample set using the former, before launching a
cluster to run the latter.
Speaking of workflows, I’d like to emphasize that you can use the
same cluster handle for many calls to
emrlapply()
. You don’t have to launch
a new cluster for each emrl
apply()
call. For example, you could use
Segue to launch a cluster in the morning, call emrlapply()
several times during the day, and
then shut down the cluster in the evening. This is very important to
know, since that initial call to createCluster()
can take several minutes to
return. You probably don’t want to do that several times a day.
Eventually, though, you’ll run out of work to do, at which point
you’ll want to shutdown your EMR cluster. Simply call Segue’s stopCluster()
:
stopCluster(emr.handle)
Keep in mind that there are only two ways to terminate the cluster:
Call stopCluster()
from
your R console
Use AWS tools (such as the AWS web console, or the command-line Elastic MapReduce toolset) to terminate the EMR job
Did you notice something missing? “Quit R” is not on this list, because closing R will not terminate the cluster.
To spare you an unexpectedly large AWS bill, I’ll even put this inside a warning box for you:
The cluster will keep running until you actively shut it down by terminating the EMR job. Even if you close R, or your local workstation crashes, your EMR cluster will keep running and Amazon will continue to bill you for the time.
This is one reason to you familiarize yourself with the Elastic
MapReduce tab of the AWS console: should your local workstation crash,
or you otherwise lose the cluster handle returned by startCluster()
, you’ll have to manually
terminate the cluster.
Segue very much abstracts you from Hadoop, Elastic MapReduce, and
even Amazon Web Services. As such, it is the “most R” (or, if you prefer,
“least Hadoop”) of the Hadoop-related strategies presented in this
book. If your goal is to run a large lapply()
-style calculation and get on with the
rest of your R work, Segue wins hands-down compared to R+Hadoop and
RHIPE
.
Tied to Amazon’s cloud: Segue only works with Elastic MapReduce. This means it won’t help you take advantage of your in-house Hadoop cluster, or your self-managed cluster in the cloud.
(This is part of why Segue isn’t helpful for data-intensive work:
data transformation and transfer to and from Amazon’s cloud would
counteract the benefits of making your lapply()
loop run in parallel.)
Requires extra responsibility: By default, Elastic MapReduce builds an ephemeral cluster that lasts only as long as a job’s runtime. Segue, on the other hand, tells EMR to leave the cluster running after the first job completes. That leaves it to you to check and double-check that you’ve truly terminated the cluster when you’re done running jobs.
Granted, this is a concern when you use any cloud resources. I mention it here because Segue shields you so well from the cluster build-out that it’s easy to forget you’ve left anything running. Beware, then, and make a habit to regularly check your AWS account’s billing page.[61]
Limited scope: Segue has just one
use case, and that narrow focus is its blessing as well as its curse. If
all you want is to turbo-charge your lapply()
loops with little distraction from your
everyday R work, Segue is a great fit. If you need anything else, or if
you live to twiddle Hadoop clusters by hand, Segue will not make you very
happy.
Segue is designed to do one thing, and do it well: use Amazon’s
Elastic MapReduce as a backend for lapply()
-style work. It abstracts you from
Hadoop and other technical details, which makes it useful for people who
find cluster management a real distraction from their R work.
[59] Segue’s original slogan was a bit spicier, which is a nice way of saying that it’s not printable. JD has since softened the message, but it’s hard not to appreciate the original slogan’s enthusiasm.
[60] Sometimes known as “embarrassingly parallel,” though we can’t fathom what could possibly be embarassing about parallel computation.
[61] I was once bitten by cluster resources gone awry (not related to Segue) and have since made a habit of checking the AWS Billing page on a regular basis. Amazon, if you’re listening: could you please provide an API to check billing? Thank you.
18.222.110.183