R and its package ecosystem often provide several alternative ways of performing the same task. R also promotes users to create their own functions for particular tasks. When execution time is important, benchmarking performance is necessary to see which strategy works best. We will concentrate on speed in this recipe. The two tasks we will look at are loading the data into R and joining two data objects based on a common variable. All tests are done on a Windows 7 desktop running a 2.4 GHz Intel processor with 8 GB of RAM.
We will use the ann2012full
and industry
data objects for our performance experiments here, along with the 2012 annual employment data CSV file for data loading. Since you already have these, you are good to go. If you don't, you will need to install the two functions, rbenchmark
and microbenchmark
, using the install.packages()
command.
The following steps will walk us through benchmarking two different tasks in R:
2012.annual.singlefile.csv
file has 3,556,289 lines of data and 15 columns. While we used fread
in this chapter, there are many other possible ways of importing this data, which are as follows:read.csv
to read the CSV file2012_annual_singlefile.zip
data file on the fly and read the data using read.csv
RData
file the first time we load it, and also subsequent times we load this file, to import the data into Rsystem.time
function, which measures the time (both elapsed and actual computing time) taken for the task to be performed:> system.time(fread('data/2012.annual.singlefile.csv')) user system elapsed 14.817 0.443 15.23
Note that the times you see will be different than those listed in the preceding command.
rbenchmark
package, which provides the benchmark
function that allows the simultaneous comparison of different functions:library(rbenchmark) opload <- benchmark( CSV=read.csv('data/2012.annual.singlefile.csv', stringsAsFactors=F), CSVZIP=read.csv(unz('data/2012_annual_singlefile.zip', '2012.annual.singlefile.csv'), stringsAsFactors=F), LOAD = load('data/ann2012full.rda'), FREAD = fread('data/2012.annual.singlefile.csv'), order='relative', # Report in order from shortest to longest replications=5 )
You can refer to the following screenshot for the output of the preceding commands:
Note that the results are ordered, and the relative times are recorded under the relative
column. This shows that fread
is quite a bit faster than reading using read.csv
. The very interesting thing is that, on an average, it is 4 times faster than loading the data from an RData
file, which is the usual storage method for R data. It is apparently faster to load the data from the file using fread
than storing the data in R's own serialized format!
left_join
in this chapter, but there are three other strategies we can take, which are as follows:merge
function available in R's standard libraryjoin
function from the plyr
packagemerge
function from the data.table
package, first transforming the data into data.table
objectsbenchmark
function to compare these strategies with left_join
:ann2012full_dt <- data.table(ann2012full, key='industry_code') industry_dt <- data.table(industry, key='industry_code') op <- benchmark( DT = data.table::merge(ann2012full_dt, industry_dt, by='industry_code', all.x=T), PLYR = plyr::join(ann2012full, industry, by='industry_code',type='left'), DPLYR = dplyr::left_join(ann2012full, industry), DPLYR2 = dplyr::left_join(ann2012full_dt, industry_dt), MERGE = merge(ann2012full, industry, by='industry_code', all.x=T), order='relative', replications=5 )
You can refer to the following screenshot for the output of the preceding commands:
Here, we see that the data.table
method is a lot faster than any other strategy. Using dplyr
is about 12 times slower for this particular task, plyr
is about 100 times slower, and the standard merge
method is 200 times slower. There is a bit of overhead in converting the data.frame
objects to data.table
objects, but the margin of advantage in this task overcomes this overhead.
The basic workhorse of time benchmarking in R is the system.time
function. This function records the time when evaluation of an expression starts, runs the expression, and then notes the time when it finishes. It then reports the difference of the two times. By default, garbage collection takes place before each evaluation so that the results are more consistent and maximal memory is freed for each evaluation.
The benchmark
function in the rbenchmark
package provides additional flexibility. It wraps the system.time
function and allows several expressions to be evaluated in a single run. It also does some basic computations, such as relative times, to simplify reporting.
In terms of our tasks here, fread
uses a powerful optimized C function to read the data, resulting in a high degree of speed optimization. The read.csv
function just reads the datafile line by line and parses the fields by the comma separator. We can get some speed improvements in our experiments by specifying the column types in read.csv
, using the colClasses
option, since determining data types consumes some execution time. The load
function reads the data from the RData
files created using the save
function, which stores binary representations of R objects. It compresses the size of the data a lot, but we see that there are more efficient ways of reading data than loading the RData
file.
The second task we set ourselves to benchmark is a left outer join of the employment data ann2014full
, with the data object of the industry
industry codes. The former has 3,556,289 rows and 15 columns, and the latter has 2,469 rows and 2 columns. They are merged based on the common variable, industry_code
. In a left join, all the rows of ann2014full
will be preserved. For this, the merge
commands will use the all.x=T
option. The join
function has the type='left'
option for a left join. For the data.table
merge, we first convert the data.frame
objects to data.table
objects, specifying that each has the same key variable, (think index in a database) industry_code
. The data.table
objects are then merged using this key variable.
There is a bit of new code formatting in this code snippet. We use plyr::join
and dplyr::left_join
, rather than just join
and left_join
. This style of coding explicitly specifies that we are using a particular function from a particular package to avoid confusion. Sometimes, this style of coding is useful when you have functions with the same name in two different packages that are both loaded in R.
The data.table
package provides very fast tools for data loading, munging, and joining. The data.table
object is a derivative object of the data.frame
package, and many of the functions in R that input data.frame
objects can also import data.table
objects. It is for this reason that the data.table
object becomes your default container for rectangular data.
microbenchmark
package for benchmarking purposes.13.58.132.97