Chapter 9. New and Upcoming

A perfect world would let us stop time to research and write, since a technical book covers a moving target. We didn’t have such a luxury, so instead we set aside some space to pick up on some new arrivals.

This chapter mentions a few tools for which we could have provided more coverage, had we been willing to postpone the book’s release date. Think of this as a look into one possible future of R parallelism. Special thanks to our colleagues, reviewers, and friends who so kindly brought these to our attention.

doRedis

The foreach() function[62] executes an arbitrary R expression across an input. foreach()’s strength is that it can execute in parallel with the help of a supplied parallel backend. The doRedis package provides such a backend, using the Redis datastore[63] as a job queue.

doRedis can work locally to take advantage of multicore systems, and also farm tasks out to remote R instances (“workers”). It’s straightforward to add or remove workers at runtime—even in mid-job—to adapt to changing work conditions or speed up job processing. Similar to Hadoop, doRedis is fault-tolerant in that failed tasks are automatically resubmitted to their job queue.

doRedis supports Linux, Mac OS X, and Windows systems.

RevoScale R and RevoConnectR (RHadoop)

Revolution Analytics is a company that provides R tools, support, and training. They have two products of note.

First up is the commercial Revolution R Enterprise. The current beta release includes RevoScaleR (RSR), which brings distributed computing to R. When you use the special XDF data format, RSR functions know to work on that data one chunk at a time, which addresses R’s memory limitations. (This is not unlike Hadoop+HDFS.) To address R’s CPU limitations, RSR includes functions to run code across several local cores, or across a cluster of machines running using MS Windows HPC Server.[64]

Second, and more recently, the Revolution gang released the open-source RHadoop packages (also known as RevoConnectR) to marry Hadoop and R: rmr provides the core MapReduce functionality; rhdfs routines let you manage data in HDFS; and rhbase talks to HBase, the Hadoop-backed database. We’re especially interested in rmr, which strives to be a clean, intuitive way to access Hadoop power without leaving the R comfort zone. RHadoop is still young, but we think it has strong potential.

Both RSR and rmr fold into your typical data analysis work: you use special functions and constructs to get the essence of a larger dataset, then pass those results to standard R functions for plotting and further analysis.

cloudNumbers.com

cloudnumbers.com is a platform for on-demand distributed, parallel computing. It provides out-of-the-box support for R as well as C/C++ and Python. We see cloudnumbers.com as a cousin of Amazon’s EC2, but specialized for scientific HPC work.

That said, cloudnumbers.com is an infrastructure, not a packaged parallelism strategy. It’s up to the researcher to choose and set up their tools—perhaps some of the topics we cover in this book—to take advantage of the hardware. We nonetheless feel this is worth mention because it is closely related to this book’s topic. You can find out more at http://cloudnumbers.com/.



[64] The upcoming Revolution R Enterprise 5.0 supports 64-bit Red Hat Enterprise Linux 5 in addition to various Windows flavors. For now, though, the cluster backend must run MS Windows HPC Server. A comment in a blog post stats the team has eyes on Linux cluster support: http://blog.revolutionanalytics.com/2011/07/fast-logistic-regression-big-data.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.229.111