Memory management in R

One of R's classic weaknesses is its difficulty in handling very large datasets, which is because R, by default, handles data by loading the full datasets in memory.

Using data analysis tools designed for large datasets, such as CERN's Root (available online at http://root.cern.ch), is one obvious solution to this problem. Root is a completely different data analysis software, and it is not easy to switch to a new data analysis platform if one has already built tools for another environment.

Some third-party R builds, including Revolution R or Renjin, have been built with memory management in mind to get around this problem. Revolution has the disadvantage of costing money to commercial users (contact Revolution about academic use). Renjin runs on the Java virtual machine, but it has the disadvantage of not being fully compatible with all R packages. For users who routinely work with very large datasets (that is a few gigabytes) and want to do it in R, it is probably best to use an R implementation that can handle this and deal with the out-of-pocket costs or the occasional compatibility problems.

For users who simply want to be able to deal with occasional large datasets that come their way, there are a couple of things to keep in mind. The first is that R is actually much more capable than it used to be. While 32-bit versions of R could only handle up to 3 GB of memory, 64-bit versions of R can handle up to 8 TB of RAM. With RAM becoming relatively inexpensive nowadays, simply upgrading the amount of RAM in your computer and running a 64-bit version of R on a 64-bit operating system is probably an upgrade that any data analyst should be making, if data size is an issue.

Basic R memory commands

The amount of memory available to R can be obtained with the memory.limit command. This command can also be used to increase the memory available by telling R to use a certain amount of virtual memory. Here, we see about 6 GB of memory available (the amount of RAM in the computer) and tell R to use up to 8 GB, as follows:

> memory.limit()
[1] 5999
> memory.limit(8000)
[1] 8000
> memory.limit()
[1] 8000

If we close the R session and then reopen it, R will reset its memory limit to the default of the total available RAM, and will no longer have this virtual memory available to it.

Using virtual memory can significantly slow down R, so if memory use is a concern, it is best to do a couple of other things before resorting to this.

When performing large computations, the first thing to do is delete all unnecessary objects in memory with the rm() command. We can remove a particular element or all elements.

First, we create three reasonably large vectors and then tell R to delete one of them, as follows:

> A <- c(1:2E8)
> B <- c(1:2E8)
> C <- c(1:2E8)
> ls()
[1] "A" "B" "C"
> rm(A)
> ls()
[1] "B" "C"

If we want to clear all objects from memory, we can use the following code:

rm(list=ls()) 

Some people will point out that not all objects have really been deleted from memory, because garbage collection still has to happen. R will do this on its own, but if the user wants to force R's garbage collection to happen, this can be done with the gc() command.

If we want to look at the size of memory used already, we can call R's memory.size command. Here, we recreate the same two objects and examine the amount of size occupied. Let's have a look at the following code:

> A <- c(1:2E8)
> B <- c(1:2E8)
> memory.size()
[1] 2333.96

Handling R objects in memory

Regardless of dataset size, one easy mistake to make in R that will slow down the handling of even relatively small datasets is constantly recreating the object. This is in effect what is done with object resizing, and this is often what is responsible for slowing down R's notoriously slow loops so much.

For example, if we want to look at the NHANES data and don't want all five ordinal responses, but want to condense things to binary responses, we can do this in a relatively straightforward manner in R. We will go through three different looping functions that do this in the following sections. The first and third functions dynamically resize the vector of interest. The second function creates a vector of the appropriate size, and then it simply replaces it with the appropriate values.

Quite possibly one of the worst ways to create a vector in R is as shown in the following code:

physical.data <- read.csv('phys_func.txt')[-1]
condense.to.binary <- function(input.vector) {
  output.vector <- c()
  a <- 0
  
  for (i in 1:length(input.vector)) {
    if (input.vector[i] == 1) {a <- 0}
    if (input.vector[i] > 1){a <- 1}
    
    output.vector <- c(output.vector, a)
  }
  
  return(output.vector)
}

Here, we create the vector and fill it with some arbitrary values early on, then simply change a value of the vector with each iteration of the loop. No copying of the vector or resizing is needed, as shown in the following code:

condense.to.binary.2 <- function(input.vector) {
  output.vector <- rep(NA, length(input.vector))
  a <- 0
  
  for (i in 1:length(input.vector)) {
    if (input.vector[i] == 1) {a <- 0}
    if (input.vector[i] > 1){a <- 1}
    
    output.vector[i] <- a
  }
  
  return(output.vector)
}

In the preceding example, we declare the vector early on and use R's index method to grow it implicitly. Since the vector is still growing with each iteration, this requires dynamic resizing and will still run slowly, as shown in the following code:

condense.to.binary.3 <- function(input.vector) {
  output.vector <- c()
  a <- 0
  
  for (i in 1:length(input.vector)) {
    if (input.vector[i] == 1) {a <- 0}
    if (input.vector[i] > 1){a <- 1}
    
    output.vector[i] <- a
  }
  
  return(output.vector)
}

We can use the system.time command to compare performances. As we can see, the second method runs much faster than the other two. Here, we use a vector that repeats the first variable of the data frame 20 times just to exaggerate the impact of the difference in coding styles, as shown in the following screenshot:

Handling R objects in memory

Dynamically resizing objects in R is a major bottleneck in code performance, because it requires memory reallocation of the entire vector in each iteration. Sometimes, it has to be done, but it should be done as rarely as possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.92