220 | Big Data Simplied
8.4 INTEGRATING HADOOP WITH R
At the beginning, open the R console in Ubuntu terminal using the following command.
amit@amit-Lenovo-Z51-70:~$ R
Once the R console is open, check the current working directory using R command ‘getwd()’.
For integrating R with Hadoop ecosystem, RHadoop package can be leveraged. RHadoop is a col-
lection of ve R packages that allows users to manage and analyse data with Hadoop. The pack-
ages have been tested on recent releases of the Cloudera and Hortonworks Hadoop distributions.
A brief description of the ve packages under RHadoop is given as follows.
rhdfs: It provides basic connectivity to the Hadoop Distributed File System. R programmers
can browse, read, write and modify files stored in HDFS from within R.
rhbase: It provides basic connectivity to the HBASE distributed database using the Thrift
server. R programmers can browse, read, write and modify tables stored in HBASE from
within R.
rmr2: It allows to perform statistical analysis in R through Hadoop MapReduce functionality
in a Hadoop cluster.
plyrmr: It enables to perform common data manipulation operations as found in popular
packages, such as plyr and reshape2 on very large data sets stored on Hadoop. Like rmr, it
relies on Hadoop MapReduce to perform its tasks, but it provides a familiar plyr-like interface
while hiding many of the MapReduce details.
ravro: It provides the ability to read and write avro files from local and HDFS file system and
adds an avro input format for rmr2.
First, download all the packages as mentioned below (or latest version) from the location. https://
github.com/RevolutionAnalytics/Rhadoop/wiki/Downloads.
M08 Big Data Simplified XXXX 01.indd 220 5/10/2019 10:01:18 AM
Working with Big Data inR | 221
For rhdfs package:rhdfs_1.0.8.tar.gz
For rhbase package:rhbase_1.2.1.tar.gz
For rmr2 package:rmr2_3.3.1.tar.gz
For plyrmr package:plyrmr_0.6.0.tar.gz
For ravro package:ravro_1.0.4.tar.gz
The les are stored in the Downloads folder (/home/<usrname>/Downloads). Before installing
each of the above packages, all the other packages on which these packages are dependent on
need to be installed. Following is a quick step-by-step guide on what to install and how.
A. Let’s first start with the rmr2 package. It has a dependency on caTools package. So, here is the
sequence of installation steps.
1. Install caTools package from within the R console (or Rstudio) using the following
command.
   >install.packages(caTools)
In case if there is an error, then you may try the extended version of the command.
    >install.packages(caTools, repos=https://cran.rstudio.com,
dependencies = TRUE)
2. Then come out of the R console to the Ubuntu prompt and run the installation for rmr2.
amit@amit-Lenovo-Z51-70:~$sudo HADOOP_CMD=/usr/bin/hadoop R CMD
INSTALL /home/amit/Downloads/rmr2_3.3.1.tar.gz
B. Next let’s install the plyrmr package. For that the dependencies are rmr2 (which is already
installed), R.methodsS3, Hmisc and rjson. Again, the package Hmisc has a dependency on ace-
pack, which can be installed if gfortran is installed. Hence, we need to start with gfortran and
install it using the following set of commands from Ubuntu prompt.
$ sudo -i
$ apt-get update
$ apt-get install gfortran
Next, we should install acepack using the following command from R console (or RStudio).
>install.packages(acepack,repos= https://cran.rstudio.com,
dependencies = TRUE)
Similarly, we shall install the packages Hmisc and R.methodsS3.
>install.packages(Hmisc,repos= https://cran.rstudio.com,
dependencies = TRUE)
>install.packages(R.methodsS3,repos= https://cran.rstudio.com,
dependencies = TRUE)
Eventually, we install plymr from the Ubuntu prompt.
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL /home/amit/
Downloads/plyrmr_0.6.0.tar.gz
M08 Big Data Simplified XXXX 01.indd 221 5/10/2019 10:01:18 AM
222 | Big Data Simplied
C. Next, we move on to install rhdfs package. It has a dependency on rJava package. So,we
should first install rJava package from the R console (or Rstudio) using the following
command.
>install.packages(‘rJava’)
Often the rJava installation encounters a lot of problem, especially in Ubuntu. In case if there
is an issue, then you may have to install it from Ubuntu root prompt using the following
command.
$ apt-get install r-cran-rjava
$ sudo add-apt-repository ppa:marutter/c2d4u3.5
$ sudo apt-get update
$ R CMD javareconf
At the end, we install rhdfs from the Ubuntu prompt.
$ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL /home/amit/
Downloads/rhdfs_1.0.8.tar.gz
D. Next we start installing rhbase package. The dependency for this package is thrift. So let’s first
install thrift.
1. The following command will install tools and libraries required to build and install the
Apache Thrift compiler and C++ libraries on a Debian/Ubuntu Linux based system.
$ sudo apt-get install automake bison flex g++ git libboost-all-
dev libevent-dev libssl-dev libtool make pkg-config
2. Download Thrift: http://thrift.apache.org/download. Copy the downloaded file into the
desired directory and untar the file and run the command as given below.
$ tar -xvf /home/amit/Downloads/thrift-0.12.0.tar.gz
$ cd thrift-0.12.0/
$ sudo ./configure
$ sudo make
$ sudo make install
$ thrift -version
Thrift version 0.9.1
3. Check the location of thrift.pc
$ cd /usr/local/lib/pkgconfig/
4. Configure the thrift.pc file.
$ sudogedit /usr/local/lib/pkgconfig/thrift.pc
M08 Big Data Simplified XXXX 01.indd 222 5/10/2019 10:01:18 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.187