Be smart about your code

In many cases, the performance of the R code can be greatly improved by simple restructuring of the code; this doesn't change the output of the program, just the way it is represented. Restructurings of this type are often referred to as code refactoring. The refactorings that really make a difference performance-wise usually have to do with either improved allocation of memory or vectorization.

Allocation of memory

Refer all the way back to Chapter 5, Using Data to Reason About the World. Remember when we created a mock population of women's heights in the US, and we repeatedly took 10,000 samples of 40 from it to demonstrate the sampling distribution of the sample means? In a code comment, I mentioned in passing that the snippet numeric(10000) created an empty vector of 10,000 elements, but I never explained why we did that. Why didn't we just create a vector of 1, and continually tack on each new sample mean to the end of it as follows:

set.seed(1)
all.us.women <- rnorm(10000, mean=65, sd=3.5)

means.of.our.samples.bad <- c(1)
# I'm increasing the number of
# samples to 30,000 to prove a point
for(i in 1:30000){
  a.sample <- sample(all.us.women, 40)
  means.of.our.samples.bad[i] <- mean(a.sample)
}

It turns out that R stores vectors in contiguous addresses in your computer's memory. This means that every time a new sample mean gets tacked on to the end of means.of.our.samples.bad, R has to make sure that the next memory block is free. If it is not, R has to find a contiguous section of memory than can fit all the elements, copy the vector over (element by element), and free the memory in the original location. In contrast, when we created an empty vector of the appropriate number of elements, R only had to find a memory location with the requisite number of free contiguous addresses once.

Let's see just what kind of difference this makes in practice. We will use the system.time function to time the execution time of both the approaches:

means.of.our.samples.bad <- c(1)
system.time(
  for(i in 1:30000){
    a.sample <- sample(all.us.women, 40)
    means.of.our.samples.bad[i] <- mean(a.sample)
  }
)

means.of.our.samples.good <- numeric(30000)
system.time(
  for(i in 1:30000){
    a.sample <- sample(all.us.women, 40)
    means.of.our.samples[i] <- mean(a.sample)
  }
)
-------------------------------------
   user  system elapsed 
  2.024   0.431   2.465
   user  system elapsed 
  0.678   0.004   0.684

Although an elapsed time saving of less than one/two seconds doesn't seem like a big deal, (a) it adds up, and (b) the difference gets more and more dramatic as the number of elements in the vector increase.

By the way, this preallocation business applies to matrices, too.

Vectorization

Were you wondering why R is so adamant about keeping the elements of vectors in adjoining memory locations? Well, if R didn't, then traversing a vector (like when you apply a function to each element) would require hunting around the memory space for the right elements in different locations. Having the elements all in a row gives us an enormous advantage, performance-wise.

To fully exploit this vector representation, it helps to use vectorized functions—which we were first introduced to in Chapter 1, RefresheR. These vectorized functions call optimized/blazingly-fast C code to operate on vectors instead of on the comparatively slower R code. For example, let's say we wanted to square each height in the all.us.women vector. One way would be to use a for-loop to square each element as follows:

system.time(
  for(i in 1:length(all.us.women))
    all.us.women[i] ^ 2
)
--------------------------
   user  system elapsed 
  0.003   0.000   0.003

Okay, not bad at all. Now what if we applied a lambda squaring function to each element using sapply?

system.time(
  sapply(all.us.women, function(x) x^2)
)
-----------------------
   user  system elapsed 
  0.006   0.000   0.006

Okay, that's worse. But we can use a function that's like sapply and which allows us to specify the type of return value in exchange for a faster processing speed:

> system.time(
+   vapply(all.us.women, function(x) x^2, numeric(1))
+ )
-------------------------
   user  system elapsed 
  0.006   0.000   0.005

Still not great. Finally, what if we just square the entire vector?

system.time(
  all.us.women ^ 2
)
----------------------
   user  system elapsed 
      0       0       0

This was so fast that system.time didn't have the resolution to detect any processing time at all. Further, this way of writing the squaring functionality was by far the easiest to read.

The moral of the story is to use vectorized options whenever you can. All of core R's arithmetic operators (+, -, ^, sqrt, log, and so on) are of this type. Additionally, using the rowSums and colSums functions on matrices is faster than apply(A_MATRIX, 1, sum) and apply(A_MATRIX, 1, sum) respectively, for much the same reason.

Speaking of matrices, before we move on, you should know that certain matrix operations are blazingly fast in R, because the routines are implemented in compiled C and/or Fortran code. If you don't believe me, try writing and testing the performance of OLS regression without using matrix multiplication.

If you have the linear algebra know-how, and have the option to rewrite a computation that you need to perform using matrix operations, you should definitely try it out.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.26.204