218 | Big Data Simplied
8.3.2
Parallel
Package
Now that we have mastered the art of optimizing the use of RAM by utilizing the ff and ffbase
packages, we shall move on to solve the next problem. The problem is slow execution of R pri-
marily due to the fact that conventional R programming is able to use only one CPU out of the
multiple ones that most of the computers today are equipped with. The parallel package, which
comes as a part of core R installation can be used to implement parallel data processing in R.
We have to start by loading the parallel package to memory. Then it is advisable to check the
number of CPUs that the computer running the R program has.
  > library(parallel)
  >detectCores()
  [1] 4
It is a good practice to create clusters of ‘n’ nodes where n = number of CPUs - 1. Since the com-
puter that we are running our R program has 4 CPUs, we will create a 3-node cluster.
  >clust<- makeCluster(3)
  >clust
  socket cluster with 3 nodes on host ‘localhost’
   >big_mean<- clusterApply(clust, as.data.frame.ffdf(ff_ccard),
fun=mean, na.rm=TRUE)
  >big_mean
  [[1]]
  [1] 94813.86
  [[2]]
  [1] 1.166582e-15
  [[3]]
  [1] 3.11899e-16
  .......
  [[29]]
  [1] -1.230406e-16
  [[30]]
  [1] 88.34962
  [[31]]
  [1] 0.001727486
Just like the clusterApply() function as used above, parallel package has a bunch of other
apply() functions like parLapply(), parSapply() and parApply(), which are even faster
in execution than
clusterApply(). It is a good practice to close the cluster connections after
the R programs which need parallel execution are complete.
  >stopCluster(clust)
M08 Big Data Simplified XXXX 01.indd 218 5/10/2019 10:01:18 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.136