Benchmarking performance for some common tasks

R and its package ecosystem often provide several alternative ways of performing the same task. R also promotes users to create their own functions for particular tasks. When execution time is important, benchmarking performance is necessary to see which strategy works best. We will concentrate on speed in this recipe. The two tasks we will look at are loading the data into R and joining two data objects based on a common variable. All tests are done on a Windows 7 desktop running a 2.4 GHz Intel processor with 8 GB of RAM.

Getting ready

We will use the ann2012full and industry data objects for our performance experiments here, along with the 2012 annual employment data CSV file for data loading. Since you already have these, you are good to go. If you don't, you will need to install the two functions, rbenchmark and microbenchmark, using the install.packages() command.

How to do it…

The following steps will walk us through benchmarking two different tasks in R:

  1. Our first task is to load the employment data into R. The 2012.annual.singlefile.csv file has 3,556,289 lines of data and 15 columns. While we used fread in this chapter, there are many other possible ways of importing this data, which are as follows:
    • The first and most standard way is to use read.csv to read the CSV file
    • You can also unzip the original 2012_annual_singlefile.zip data file on the fly and read the data using read.csv
    • We can save the data to an RData file the first time we load it, and also subsequent times we load this file, to import the data into R
  2. The most basic way to benchmark speed is using the system.time function, which measures the time (both elapsed and actual computing time) taken for the task to be performed:
    > system.time(fread('data/2012.annual.singlefile.csv'))
       user  system elapsed
      14.817   0.443   15.23
    

    Note that the times you see will be different than those listed in the preceding command.

  3. However, there are packages in R that make benchmarking and comparing different functions much easier. We will introduce the rbenchmark package, which provides the benchmark function that allows the simultaneous comparison of different functions:
    library(rbenchmark)
    opload <- benchmark(
      CSV=read.csv('data/2012.annual.singlefile.csv', stringsAsFactors=F),
      CSVZIP=read.csv(unz('data/2012_annual_singlefile.zip',
       '2012.annual.singlefile.csv'), stringsAsFactors=F),
      LOAD = load('data/ann2012full.rda'),
      FREAD = fread('data/2012.annual.singlefile.csv'),
      order='relative', # Report in order from shortest to longest 
      replications=5
    )
    

    You can refer to the following screenshot for the output of the preceding commands:

    How to do it…

    Note that the results are ordered, and the relative times are recorded under the relative column. This shows that fread is quite a bit faster than reading using read.csv. The very interesting thing is that, on an average, it is 4 times faster than loading the data from an RData file, which is the usual storage method for R data. It is apparently faster to load the data from the file using fread than storing the data in R's own serialized format!

  4. Our second task is to perform a left outer join of two data objects. We'll look at a task that we have already performed—a left join of the employment data with the industry codes. A left join ensures that the rows of data on the left of the operation will be preserved through the operation, and the other data will be expanded by repetition or missing data to have the same number of rows. We used left_join in this chapter, but there are three other strategies we can take, which are as follows:
    • The merge function available in R's standard library
    • The join function from the plyr package
    • The merge function from the data.table package, first transforming the data into data.table objects
  5. We will again use the benchmark function to compare these strategies with left_join:
    ann2012full_dt <- data.table(ann2012full, key='industry_code')
    industry_dt <- data.table(industry, key='industry_code')
    op <- benchmark(
       DT = data.table::merge(ann2012full_dt, industry_dt,
                  by='industry_code', all.x=T),
       PLYR = plyr::join(ann2012full, industry,         by='industry_code',type='left'),
     DPLYR = dplyr::left_join(ann2012full, industry),
     DPLYR2 = dplyr::left_join(ann2012full_dt, industry_dt),
     MERGE = merge(ann2012full, industry,
                   by='industry_code', all.x=T),
     order='relative',
     replications=5
    ) 
    

    You can refer to the following screenshot for the output of the preceding commands:

    How to do it…

    Here, we see that the data.table method is a lot faster than any other strategy. Using dplyr is about 12 times slower for this particular task, plyr is about 100 times slower, and the standard merge method is 200 times slower. There is a bit of overhead in converting the data.frame objects to data.table objects, but the margin of advantage in this task overcomes this overhead.

How it works…

The basic workhorse of time benchmarking in R is the system.time function. This function records the time when evaluation of an expression starts, runs the expression, and then notes the time when it finishes. It then reports the difference of the two times. By default, garbage collection takes place before each evaluation so that the results are more consistent and maximal memory is freed for each evaluation.

The benchmark function in the rbenchmark package provides additional flexibility. It wraps the system.time function and allows several expressions to be evaluated in a single run. It also does some basic computations, such as relative times, to simplify reporting.

In terms of our tasks here, fread uses a powerful optimized C function to read the data, resulting in a high degree of speed optimization. The read.csv function just reads the datafile line by line and parses the fields by the comma separator. We can get some speed improvements in our experiments by specifying the column types in read.csv, using the colClasses option, since determining data types consumes some execution time. The load function reads the data from the RData files created using the save function, which stores binary representations of R objects. It compresses the size of the data a lot, but we see that there are more efficient ways of reading data than loading the RData file.

The second task we set ourselves to benchmark is a left outer join of the employment data ann2014full, with the data object of the industry industry codes. The former has 3,556,289 rows and 15 columns, and the latter has 2,469 rows and 2 columns. They are merged based on the common variable, industry_code. In a left join, all the rows of ann2014full will be preserved. For this, the merge commands will use the all.x=T option. The join function has the type='left' option for a left join. For the data.table merge, we first convert the data.frame objects to data.table objects, specifying that each has the same key variable, (think index in a database) industry_code. The data.table objects are then merged using this key variable.

There is a bit of new code formatting in this code snippet. We use plyr::join and dplyr::left_join, rather than just join and left_join. This style of coding explicitly specifies that we are using a particular function from a particular package to avoid confusion. Sometimes, this style of coding is useful when you have functions with the same name in two different packages that are both loaded in R.

There's more…

The data.table package provides very fast tools for data loading, munging, and joining. The data.table object is a derivative object of the data.frame package, and many of the functions in R that input data.frame objects can also import data.table objects. It is for this reason that the data.table object becomes your default container for rectangular data.

See also

  • Hadley Wickham has a very nice exposition on benchmarking that is part of his online book, available at http://adv-r.had.co.nz/Performance.html. He promotes the microbenchmark package for benchmarking purposes.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.132.97