Other R packages for large scale machine learning

Apart from RHadoop and SparkR, there are several other native R packages specifically built for large-scale machine learning. Here, we give a brief overview of them. Interested readers should refer to CRAN Task View: High-Performance and Parallel Computing with R (reference 10 in the References section of the chapter).

Though R is single-threaded, there exists several packages for parallel computation in R. Some of the well-known packages are Rmpi (R version of the popular message passing interface), multicore, snow (for building R clusters), and foreach. From R 2.14.0, a new package called parallel started shipping with the base R. We will discuss some of its features here.

The parallel R package

The parallel package is built on top of the multicore and snow packages. It is useful for running a single program on multiple datasets such as K-fold cross validation. It can be used for parallelizing in a single machine over multiple CPUs/cores or across several machines. For parallelizing across a cluster of machines, it evokes MPI (message passing interface) using the Rmpi package.

We will illustrate the use of parallel package with the simple example of computing a square of numbers in the list 1:100000. This example will not work in Windows since the corresponding R does not support the multicore package. It can be tested on any Linux or OS X platform.

The sequential way of performing this operation is to use the lapply function as follows:

>nsquare <- function(n){return(n*n)}
>range <- c(1:100000)
>system.time(lapply(range,nsquare))

Using the mclapply function of the parallel package, this computation can be achieved in much less time:

>library(parallel) #included in R core packages, no separate installation required
>numCores<-detectCores( )  #to find the number of cores in the machine
>system.time(mclapply(range,nsquare,mc.cores=numCores))

If the dataset is so large that it needs a cluster of computers, we can use the parLapply function to run the program over a cluster. This needs the Rmpi package:

>install.packages(Rmpi)#one time
>library(Rmpi)
>numNodes<-4 #number of workers nodes
>cl<-makeCluster(numNodes,type="MPI")
>system.time(parLapply(cl,range,nsquare))
>stopCluster(cl)
>mpi.exit( )

The foreach R package

This is a new looping construct in R that can be executed in parallel across multicores or clusters. It has two important operators: %do% for repeatedly doing a task and %dopar% for executing tasks in parallel.

For example, the squaring function we discussed in the previous section can be implemented using a single line command using the foreach package:

>install.packages(foreach)#one time
>install.packages(doParallel)#one time
>library(foreach)
>library(doParallel)
>system.time(foreach(i=1:100000)   %do%  i^2) #for executing sequentially
>system.time(foreach(i=1:100000)   %dopar%  i^2) #for executing in parallel

We will also do an example of quick sort using the foreach function:

>qsort<- function(x) {
  n <- length(x)
  if (n == 0) {
    x
  } else {
    p <- sample(n,1)
    smaller <- foreach(y=x[-p],.combine=c) %:% when(y <= x[p]) %do% y
    larger  <- foreach(y=x[-p],.combine=c) %:% when(y >  x[p]) %do% y
    c(qsort(smaller),x[p],qsort(larger))
  }
}
qsort(runif(12))

These packages are still undergoing a lot of development. They have not yet been used in a large way for Bayesian modeling. It is easy to use them for Bayesian inference applications such as Monte Carlo simulations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.146