A perfect world would let us stop time to research and write, since a technical book covers a moving target. We didn’t have such a luxury, so instead we set aside some space to pick up on some new arrivals.
This chapter mentions a few tools for which we could have provided more coverage, had we been willing to postpone the book’s release date. Think of this as a look into one possible future of R parallelism. Special thanks to our colleagues, reviewers, and friends who so kindly brought these to our attention.
The foreach()
function[62] executes an arbitrary R expression across an input. foreach()
’s strength is
that it can execute in parallel with the help of a
supplied parallel backend. The doRedis
package provides such a backend, using the Redis datastore[63] as a job queue.
doRedis
can work locally to take
advantage of multicore systems, and also farm tasks out to remote R
instances (“workers”). It’s straightforward to add or remove workers at
runtime—even in mid-job—to adapt to changing work conditions or speed up
job processing. Similar to Hadoop, doRedis
is fault-tolerant in that failed tasks
are automatically resubmitted to their job queue.
doRedis
supports Linux, Mac OS X,
and Windows systems.
Description: http://bigcomputing.com/doRedis.html
Revolution Analytics is a company that provides R tools, support, and training. They have two products of note.
First up is the commercial Revolution R Enterprise. The current beta release includes RevoScaleR (RSR), which brings distributed computing to R. When you use the special XDF data format, RSR functions know to work on that data one chunk at a time, which addresses R’s memory limitations. (This is not unlike Hadoop+HDFS.) To address R’s CPU limitations, RSR includes functions to run code across several local cores, or across a cluster of machines running using MS Windows HPC Server.[64]
Second, and more recently, the Revolution gang released the
open-source RHadoop packages (also known as
RevoConnectR) to marry Hadoop and R: rmr
provides the core MapReduce functionality;
rhdfs
routines let you manage data in
HDFS; and rhbase
talks to HBase, the
Hadoop-backed database. We’re especially interested in rmr
, which strives to be a clean, intuitive way
to access Hadoop power without leaving the R comfort zone. RHadoop is
still young, but we think it has strong potential.
Both RSR and rmr
fold into your
typical data analysis work: you use special functions and constructs to
get the essence of a larger dataset, then pass those results to standard R
functions for plotting and further analysis.
Revolution R Enterprise and RevoScaleR: http://www.revolutionanalytics.com/
RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki
cloudnumbers.com is a platform for on-demand distributed, parallel computing. It provides out-of-the-box support for R as well as C/C++ and Python. We see cloudnumbers.com as a cousin of Amazon’s EC2, but specialized for scientific HPC work.
That said, cloudnumbers.com is an infrastructure, not a packaged parallelism strategy. It’s up to the researcher to choose and set up their tools—perhaps some of the topics we cover in this book—to take advantage of the hardware. We nonetheless feel this is worth mention because it is closely related to this book’s topic. You can find out more at http://cloudnumbers.com/.
[64] The upcoming Revolution R Enterprise 5.0 supports 64-bit Red Hat Enterprise Linux 5 in addition to various Windows flavors. For now, though, the cluster backend must run MS Windows HPC Server. A comment in a blog post stats the team has eyes on Linux cluster support: http://blog.revolutionanalytics.com/2011/07/fast-logistic-regression-big-data.html.
3.15.229.111