R and Big Data

Before we dive deep into building a few recommender systems using real-world data sets, we'll take a short detour and spend some time thinking about Big Data. Many real-world recommender systems arise out of the analysis of massive data sets. Examples include the product recommendation engine of amazon.com and the movie recommendation engine of Netflix.

Most, if not all, of the data sets that we have looked at in this book have been relatively small in size and have been chosen quite intentionally in order for the reader to be able to follow along with the examples and not have to worry about having access to powerful computing resources. These days, the field of predictive analytics, as well as the related fields of machine learning, data science, and data analysis in general, is heavily concerned with the importance of handling Big Data.

Note

The term Big Data has become a buzzword that has entered everyday conversation and as an inevitable result, we often encounter uses that reveal conflicting or muddled interpretations. For example, Big Data is not only concerned with the volume of a data set, but also covers issues such as how fast we need to process data in real time as well as the diversity of the data that we need to process. Consequently, volume, velocity, and variety are often referred to as the three Vs of Big Data. To learn more about this exciting field, an excellent reference is The Big Data Revolution by Jason Kolb and Jeremy Kolb.

The base R distribution is designed to operate with data that fits into computer memory. Often, the data we want to analyze is so large that processing it all in the memory of a single computer isn't possible. In some cases, we can take advantage of on-demand computing resources, such as Amazon's EC2, and have access to machines with over 100 GB of memory. To do this efficiently, however, we often need to be aware that processing very large data sets, even in memory, can still be very time consuming in R and we may continue to need a way to improve performance. Consequently, the approaches for handling Big Data in R can be roughly grouped into three broad areas.

The first approach to handling Big Data is to carry out sampling. That is, we will not use all of the data available to us to build our model, but will create a representative sample of these data. Sampling is generally the least recommended approach as it is natural to expect degradation in model performance when we use fewer training data. This approach can potentially work quite well if the size of the sample we are able to use is still very large in absolute size (for example, a billion rows) as well as in relative size with respect to the original data set. Great care must be taken in order to avoid introducing any form of bias in the sample.

A second approach to working with Big Data is to take advantage of distributed processing. The key idea here is to split our data across different machines working together in a cluster. Individually, the machines need not be very powerful because they will only process chunks of the data.

The Programming with Big Data in R project has a number of R packages for high-performance computing that interface with parallel processing libraries. More details on this project can be found through the project's website, http://r-pbd.org/, and by first starting out with the pbdDEMO package, which is designed for newcomers to this project.

Another alternative is to interface R to work directly with a distributed processing platform such as Apache Hadoop. An excellent reference for doing this is Big Data Analytics with R and Hadoop published by Packt Publishing. Finally, an exciting new alternative to working with Hadoop is the Apache Spark project. SparkR is a package that allows running jobs on a Spark cluster directly from the R shell. This package is currently available at http://amplab-extras.github.io/SparkR-pkg/.

The third possible avenue for working with Big Data is to work with (potentially on-demand) resources that have very high memory and optimize performance on a single machine. One possibility for this is to interface with a language such as C++ and leverage access to advanced data structures that can optimize the processing of data for a particular problem. This way, some of the processing can be done outside of R.

In R, the package Rcpp provides us with an interface to work with C++. Another excellent package for working with large data sets, and the one we will use in this chapter when we load some real-world data sets, is the package data.table, specifically designed to work with machines that have a lot of memory.

Loading data sets on the order of 100 GB on a 64-bit machine is a common use case when working with the data.table package. This package has been designed with the goal of substantially reducing the computation time of common operations that are performed on data frames. More specifically, it introduces the notion of a data table as a replacement data structure for R's ubiquitous data frame. This is not only a more efficient data structure on which to perform operations, but has a number of shortcuts and commands that make programming with data sets faster as well.

A critical advantage of this package is that the data table data structure is accepted by other packages anywhere a data frame is. Packages that are unaware of the data table syntax can use data frame syntax for working with data tables. An excellent online resource to learn more about the data.table package is an online course by Matt Dowle, the main creator of the package, and can be found at https://www.datacamp.com/courses/data-analysis-the-data-table-way. Without further ado, we will start building some recommender systems where we will load the data in data tables using the data.table package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.95.74