Working with Big Data inR | 213
Let us try to analyse the problem further more deeply and understand them in detail.
Problem 1: Data set size exceeding the available memory size.
In general, most of the personal computers used currently have 16 GB of RAM. Assuming that
20–30% is needed for different system activities and also other application programs, it is fair to
assume that a maximum of 70% of RAM memory, i.e., 11 GB of memory from the available 16GB
RAM memory size can be utilized by the R program. For a computer having less RAM size, say
4 GB, only around 3 GB can be used by R program. In conventional R programming, data frame
object is created in the R workspace, which sits in the RAM. Therefore, by using conventional
R programming, we can almost work with data sets having size less than 11 GB for a relatively
high-end computer consisting a RAM size of 16 GB. In addition, working with data sets larger
than 11 GB will not be possible.
Problem 2: Slow processing speed of R.
R is an interpreted language, which makes it slow anyways. On top of that, R core is
single-threaded, which means code blocks are executed one-by-one in a single CPU.
So how do we solve these problems of handling large size data sets and at a reasonably good
performance?
R has a set packages supporting Big Data processing. Let us review a few libraries and how it
can be used to solve the problems as mentioned above.
8.3.1 ff and ffbase Packages
The ff package is quite useful in processing large data sets. Instead of using the conventional
approach by creating a data frame object for the data set in the R workspace, the ff package cre-
ates an ff data structure in the R workspace. It stores the physical data set divided into multiple
chunks on the hard drive. The ff object created in RAM is just the metadata and it is much smaller
in size. Thus, larger data sets can be loaded into R for processing without high space require-
ments in RAM. Let us try to review this with some test code and a real data set.
We shall use a credit card fraud data set containing transactions made by credit cards in
September 2013 by European cardholders. This dataset showcases the transactions that
occurred in two days, where we have 492 frauds out of 284,807 transactions. This dataset
has been collected and analysed during a research collaboration of Worldline and the Machine
Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on Big Data min-
ing and fraud detection. The data set is of size 148 MB and will be uploaded as an online con-
tent in this book.
Now, let us first apply the conventional R program and check the performance.
Code:
>df_ccard<- read.table(“creditcard.csv”,sep=”,”, header=TRUE)
>object.size(df_ccard)
69496704 bytes
Outcome: The data frame object created in R workspace is of size 69.5 MB and the time taken to
load the data set is 42.5 seconds.
Now let us try to do the same thing using ff package. For that we have to first install the pack-
age from CRAN mirror and load it.
M08 Big Data Simplified XXXX 01.indd 213 5/10/2019 10:01:17 AM