Working with Big Data inR | 213
Let us try to analyse the problem further more deeply and understand them in detail.
Problem 1: Data set size exceeding the available memory size.
In general, most of the personal computers used currently have 16 GB of RAM. Assuming that
20–30% is needed for different system activities and also other application programs, it is fair to
assume that a maximum of 70% of RAM memory, i.e., 11 GB of memory from the available 16GB
RAM memory size can be utilized by the R program. For a computer having less RAM size, say
4 GB, only around 3 GB can be used by R program. In conventional R programming, data frame
object is created in the R workspace, which sits in the RAM. Therefore, by using conventional
R programming, we can almost work with data sets having size less than 11 GB for a relatively
high-end computer consisting a RAM size of 16 GB. In addition, working with data sets larger
than 11 GB will not be possible.
Problem 2: Slow processing speed of R.
R is an interpreted language, which makes it slow anyways. On top of that, R core is
single-threaded, which means code blocks are executed one-by-one in a single CPU.
So how do we solve these problems of handling large size data sets and at a reasonably good
performance?
R has a set packages supporting Big Data processing. Let us review a few libraries and how it
can be used to solve the problems as mentioned above.
8.3.1 ff and ffbase Packages
The ff package is quite useful in processing large data sets. Instead of using the conventional
approach by creating a data frame object for the data set in the R workspace, the ff package cre-
ates an ff data structure in the R workspace. It stores the physical data set divided into multiple
chunks on the hard drive. The ff object created in RAM is just the metadata and it is much smaller
in size. Thus, larger data sets can be loaded into R for processing without high space require-
ments in RAM. Let us try to review this with some test code and a real data set.
We shall use a credit card fraud data set containing transactions made by credit cards in
September 2013 by European cardholders. This dataset showcases the transactions that
occurred in two days, where we have 492 frauds out of 284,807 transactions. This dataset
has been collected and analysed during a research collaboration of Worldline and the Machine
Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on Big Data min-
ing and fraud detection. The data set is of size 148 MB and will be uploaded as an online con-
tent in this book.
Now, let us first apply the conventional R program and check the performance.
Code:
  >df_ccard<- read.table(creditcard.csv,sep=,, header=TRUE)
  >object.size(df_ccard)
  69496704 bytes
Outcome: The data frame object created in R workspace is of size 69.5 MB and the time taken to
load the data set is 42.5 seconds.
Now let us try to do the same thing using ff package. For that we have to first install the pack-
age from CRAN mirror and load it.
M08 Big Data Simplified XXXX 01.indd 213 5/10/2019 10:01:17 AM
214 | Big Data Simplied
Code:
  > library(ff)
   >ff_ccard<- read.table.ffdf(file=creditcard.csv,sep=,,
VERBOSE=TRUE,header=TRUE, next.rows=10000,colClasses=NA)
read.table.ffdf 1..10000 (10000) csv-read=0.51sec ffdf-write=1.8sec
read.table.ffdf 10001..20000 (10000) csv-read=0.44sec ffdf-write=0.11sec
read.table.ffdf 20001..30000 (10000) csv-read=0.42sec ffdf-write=0.11sec
read.table.ffdf 30001..40000 (10000) csv-read=0.45sec ffdf-write=0.11sec
read.table.ffdf 40001..50000 (10000) csv-read=0.48sec ffdf-write=0.11sec
read.table.ffdf 50001..60000 (10000) csv-read=0.46sec ffdf-write=0.11sec
read.table.ffdf 60001..70000 (10000) csv-read=0.43sec ffdf-write=0.13sec
read.table.ffdf 70001..80000 (10000) csv-read=0.44sec ffdf-write=0.1sec
read.table.ffdf 80001..90000 (10000) csv-read=0.45sec ffdf-write=0.57sec
read.table.ffdf 90001..100000 (10000) csv-read=0.46sec ffdf-write=0.11sec
read.table.ffdf 100001..110000 (10000) csv-read=0.47sec ffdf-write=0.13sec
read.table.ffdf 110001..120000 (10000) csv-read=0.47sec ffdf-write=0.11sec
read.table.ffdf 120001..130000 (10000) csv-read=0.46sec ffdf-write=0.11sec
read.table.ffdf 130001..140000 (10000) csv-read=0.49sec ffdf-write=0.2sec
read.table.ffdf 140001..150000 (10000) csv-read=0.49sec ffdf-write=0.12sec
read.table.ffdf 150001..160000 (10000) csv-read=0.47sec ffdf-write=0.11sec
read.table.ffdf 160001..170000 (10000) csv-read=0.48sec ffdf-write=0.11sec
read.table.ffdf 170001..180000 (10000) csv-read=0.49sec ffdf-write=0.22sec
read.table.ffdf 180001..190000 (10000) csv-read=0.46sec ffdf-write=0.11sec
read.table.ffdf 190001..200000 (10000) csv-read=0.47sec ffdf-write=0.11sec
read.table.ffdf 200001..210000 (10000) csv-read=0.47sec ffdf-write=0.11sec
read.table.ffdf 210001..220000 (10000) csv-read=0.47sec ffdf-write=0.11sec
read.table.ffdf 220001..230000 (10000) csv-read=0.48sec ffdf-write=0.11sec
read.table.ffdf 230001..240000 (10000) csv-read=0.47sec ffdf-write=0.14sec
read.table.ffdf 240001..250000 (10000) csv-read=0.49sec ffdf-write=0.11sec
read.table.ffdf 250001..260000 (10000) csv-read=0.48sec ffdf-write=0.09sec
read.table.ffdf 260001..270000 (10000) csv-read=0.49sec ffdf-write=0.11sec
read.table.ffdf 270001..280000 (10000) csv-read=0.48sec ffdf-write=0.13sec
read.table.ffdf 280001..284807 (4807) csv-read=0.22sec ffdf-write=0.1sec
csv-read=14.34sec ffdf-write=5.6sec TOTAL=19.94sec
  >object.size(ff_ccard)
  104336 bytes
Outcome: The data frame object created in R workspace is of size 0.1 MB and the time taken to
load the data set is 19.9 seconds.
Clearly, by using ff package, the available memory requirement in RAM has decreased drasti-
cally. Also, there is a significant improvement in performance.
Note that the ‘next.rows’ parameter of read.table.ffdf function is used to specify the number to
be selected as a part of each chunk. Since in the above case ‘next.rows’ parameter was assigned
M08 Big Data Simplified XXXX 01.indd 214 5/10/2019 10:01:17 AM
Working with Big Data inR | 215
a value of 10,000 and the total number of rows in the data set is 284,807. The total number of
chunks created is 29 (28 chunks of 10,000 and 1 chunk of 4807).
In addition to importing large data sets to R as shown above, the ff package also includes
a number of other data processing functions. The ffbase package on the other hand, extends
ffpackage by including a number of statistical and mathematical functions. It also includes some
classification and regression models to be applied on ff object using some third-party packages
supporting Big Data analytics, like biglm and bigrf. Presented below are a few salient functions
of ff and ffbase packages which can be used for data processing and analytics of large data sets.
Sr # Function Purpose Sample Code with Output
1.
class()
Gives the type of an object.
> class(ff_ccard)
[1] ffdf
2.
dim()
Gives the dimension of the ffdf
object.
>dim(ff_ccard)
[1] 284807 31
3.
dimnames()
Gives the dimension names or
name of attributes of the ffdf
object.
>dimnames(ff_ccard)
[[1]]
NULL
[[2]]
[1] Time V1 V2 V3
V4 V5 V6 V7 V8
V9 V10 V11 V12
[14] V13 V14 V15
V16 V17 V18 V19
V20 V21 V22 V23
V24 V25
[27] V26 V27 V28
Amount Class
4.
unique()
Gives the unique values of an
attribute
> library(ffbase)
>unique(ff_ccard$Class)
> length(unique(ff_
ccard$V1))
[1] 275663
5.
as.data.
frame.
ffdf()
Convertsffdf structure
to a standard data.frame object
>as.data.frame.ffdf(ff_
ccard$V1)
(Continued)
M08 Big Data Simplified XXXX 01.indd 215 5/10/2019 10:01:17 AM
216 | Big Data Simplied
Sr # Function Purpose Sample Code with Output
6.
describe()
Like the core R summary()
function, gives basic descriptive
statistics on the data.
>library(Hmisc)
> describe(as.data.frame.
ffdf(ff_ccard$V1))
as.data.frame.ffdf(ff_
ccard$V1)
n missing distinct Info
Mean Gmd .05 .10 .25 .50
.75 .90
284807 0 275663 1 1.176e-
15 1.928 -2.89915 -1.89327
-0.92037 0.01811 1.31564
2.01541
.95
2.08122
lowest : -56.407510
-46.855047 -41.928738
-40.470142 -40.042537,
highest: 2.430507 2.439207
2.446505 2.451888 2.454930
7.
subset.
ffdf()
Subsetsffdf object
>sub_ffcard<- subset.
ffdf(ff_ccard, Class == 1)
> dim(sub_ffcard)
[1] 492 31
This is the number of
records having Class = 1,
i.e., fraudulent records.
>sub_ffcard<- subset.
ffdf(ff_ccard, Class == 1,
select = c(Amount))
>sum(as.data.frame.
ffdf(sub_ffcard))
[1] 60127.97
This is the total amount of
fraudulent transactions.
(Continued)
M08 Big Data Simplified XXXX 01.indd 216 5/10/2019 10:01:18 AM
Working with Big Data inR | 217
Sr # Function Purpose Sample Code with Output
8.
write.
table.
ffdf()or
write.csv.
ffdf()
Exports an ff object to a TXT
(orCSV) le.
>write.table.ffdf(sub_
ffcard, Fraud
transactions.txt, VERBOSE
= TRUE)
write.table.ffdf 1..492
(492, 100%) ffdf-read=0sec
csv-write=0sec
ffdf-read=0sec csv-
write=0sec TOTAL=0sec
9
glm ()
Derives general linear model
that denes a linear relationship
between the target variable and
the predictor variables.
>install.packages(biglm)
>library(biglm)
>mod_logit<- glm(Class ~ V1
+ V2 + V3, data = ff_ccard,
family = binomial(link
= logit), na.action =
na.omit)
>mod_logit
Call: glm(formula = Class
~ V1 + V2 + V3, family =
binomial(link = logit),
data = ff_ccard, na.action
= na.omit)
Coefficients:
(Intercept) V1 V2 V3
-7.5400 0.2813 0.3138
-0.6948
Degrees of Freedom: 284806
Total (i.e., Null); 284803
Residual
Null Deviance: 7242
Residual Deviance: 4762
AIC: 4770
M08 Big Data Simplified XXXX 01.indd 217 5/10/2019 10:01:18 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.109.234