RHadoop for using Hadoop from R

RHadoop is a collection of open source packages using which an R user can manage and analyze data stored in the Hadoop Distributed File System (HDFS). In the background, RHadoop will translate these as MapReduce operations in Java and run them on HDFS.

The various packages in RHadoop and their uses are as follows:

  • rhdfs: Using this package, a user can connect to an HDFS from R and perform basic actions such as read, write, and modify files.
  • rhbase: This is the package to connect to a HBASE database from R and to read, write, and modify tables.
  • plyrmr: Using this package, an R user can do the common data manipulation tasks such as the slicing and dicing of datasets. This is similar to the function of packages such as plyr or reshape2.
  • rmr2: Using this package, a user can write MapReduce functions in R and execute them in an HDFS.

Unlike the other packages discussed in this book, the packages associated with RHadoop are not available from CRAN. They can be downloaded from the GitHub repository at https://github.com/RevolutionAnalytics and are installed from the local drive.

Here is a sample MapReduce code written using the rmr2 package to count the number of words in a corpus (reference 3 in the References section of this chapter):

  1. The first step involves loading the rmr2 library:
    >library(rmr2)
    >LOCAL <- T #to execute rmr2 locally
  2. The second step involves writing the Map function. This function takes each line in the text document and splits it into words. Each word is taken as a token. The function emits key-value pairs where each distinct word is a key and value = 1:
    >#map function
    >map.wc <- function(k,lines){ 
           words.list <- strsplit(lines,'\s+^' )
            words <- unlist(words.list)         
            return(keyval(words,1))
        }
  3. The third step involves writing a reduce function. This function groups all the same key from different mappers and sums their value. Since, in this case, each word is a key and the value = 1, the output of the reduce will be the count of the words:
    >#reduce function
    >reduce.wc<-function(word,counts){
              return(keyval(word,sum(counts) ))
    }
  4. The fourth step involves writing a word count function combining the map and reduce functions and executing this function on a file named hdfs.data stored in the HDFS containing the input text:
    >#word count function
    >wordcount<-function(input,output=NULL){
                mapreduce(input = input,output = output,input.format = "text",map = map.wc,reduce = reduce.wc,combine = T)
    }
    >out<-wordcount(hdfs.data,hdfs.out)
  5. The fifth step involves getting the output file from HDFS and printing the top five lines:
    >results<-from.dfs(out)
    >results.df<-as.data.frame(results,stringAsFactors=F)
    >colnames(results.df)<-c('word^' ,^' count^')
    >head(results.df)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.142.85