Using Rcpp

Contrary to what I sometimes like to believe, there are other computer programming languages than just R. R—and languages like Python, Perl, and Ruby—are considered high-level languages, because they offer a greater level of abstraction from computer representations and resource management than the lower-level languages. For example, in some lower level languages, you must specify the data type of the variables you create and manage the allocation of RAM manually—C, C++, and Fortran are of this type.

The high level of abstraction R provides allows us to do amazing things very quickly—like import a data set, run a linear model, and plot the data and regression line in no more than 4 lines of code! On the other hand, nothing quite beats the performance of carefully crafted lower-level code. Even so, it would take hundreds of lines of code to run a linear model in a low-level language, so a language like that is inappropriate for agile analytics.

One solution is to use R abstractions when we can, and be able to get down to lower-level programming where it can really make a large difference. There are a few paths for connecting R and lower-level languages, but the easiest way by far is to combine R and C++ with Rcpp.

Note

There are differences in what is considered high-level. For this reason, you will sometimes see people and texts (mostly older texts) refer to C and C++ as a high-level language. The same people may consider R, Python, and so on as very high-level languages. Therefore, the level of a language is somewhat relative.

A word of warning before we go on: This is an advanced topic, and this section will (out of necessity) gloss over some (most) of the finer details of C++ and Rcpp. If you're wondering whether a detailed reading will pay off, it's worth taking a peek at the conclusion of this section to see how many seconds it took to complete the average-distance-between-all-airports task that would have taken over 2 hours to complete unoptimized.

If you decide to continue, you must install a C++ compiler. On GNU/Linux this is usually done through the system's package manager. On Mac OS X, XCode must be installed; it is available free in the App Store. For Windows, you must install the Rtools available at http://cran.r-project.org/bin/windows/Rtools/. Finally, all users need to install the Rcpp package. For more information, consult sections 1.2 and 1.3 of the Rcpp FAQ (http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf).

Essentially, our integration of R and C++ is going to take the form of us rewriting certain functions in in C++, and calling them in R. Rcpp makes this very easy; before we discuss how to write C++ code, let's look at an example. Put the following code into a file, and name it our_cpp_function.cpp:

#include <Rcpp.h>

// [[Rcpp::export]]
double square(double number){
    return(pow(number, 2));
}

Congratulations, you've just written a C++ program! Now, from R, we'll read the C++ file, and make the function available to R. Then, we'll test out our new function.

library(Rcpp)

sourceCpp("our_cpp_functions.cpp")

square(3)
--------------------------------
[1] 9

The first two lines with text have nothing to do with our function, per se. The first line is necessary for C++ to integrate with R. The second line (// [[Rcpp::export]]) tells R that we want the function directly below it to be available for use (exported) within R. Functions that aren't exported can only be used in the C++ file, internally.

Note

The // is a comment in C++, and it works just like # in R. C++ also has another type of comment that can span multiple lines. These multiline comments start with /* and end with */.

Throughout this section, we'll be adding functions to our_cpp_functions.cpp and re-sourcing the file from R to use the new C++ functions.

The following modest square function can teach us a lot about the differences between the C++ code and R code. For example, the preceding C++ function is roughly equivalent to the following in R:

square <- function(number){
  return(number^2)
}

The two doubles denote that the return value and the argument respectively, are both of data type double. double stands for double precision floating point number, which is roughly equivalent to R's more general numeric data type.

The second thing to notice is that we raise numbers to powers using the pow function, instead of using the ^ operator, like in R. This is a minor syntactical difference. The third thing to note is that each statement in C++ ends with a semicolon.

Believe it or not, we now have enough knowledge to rewrite the to.radians function in C++.

/* Add this (and all other snippets that
   start with "// [[Rcpp::export]]") 
   to the C++ file, not the R code. */

// [[Rcpp::export]]
double to_radians_cpp(double degrees){
    return(degrees * 3.141593 / 180);
}
# with goes with our R code
sourceCpp("our_cpp_functions.cpp")
to_radians_cpp(10)
-------------------------
[1] 0.174533

Incredibly, with the help of some search-engine-fu or a good C++ reference, we can rewrite the whole haversine function in C++ as follows:

// [[Rcpp::export]]
double haversine_cpp(double lat1, double long1,
                     double lat2, double long2,
                     std::string unit="km"){
    int radius = 6378;
    double delta_phi = to_radians_cpp(lat2 - lat1);
    double delta_lambda = to_radians_cpp(long2 - long1);
    double phi1 = to_radians_cpp(lat1);
    double phi2 = to_radians_cpp(lat2);
    double term1 = pow(sin(delta_phi / 2), 2);
    double term2 = cos(phi1) * cos(phi2)
    term2 = term2 * pow(sin(delta_lambda/2), 2);
    double the_terms = term1 + term2;
    double delta_sigma = 2 * atan2(sqrt(the_terms),
                                   sqrt(1-the_terms));
    double distance = radius * delta_sigma;
 
    /* if it is anything *but* km it is miles */
    if(unit != "km"){
        return(distance*0.621371);
    }
 
    return(distance);
}

Now, let's re-source it, and test it...

sourceCpp("our_cpp_functions.cpp")
haversine(51.88, 176.65, 56.94, 154.18)
haversine_cpp(51.88, 176.65, 56.94, 154.18)
----------------------------------------------
[1] 1552.079
[1] 1552.079

Are you surprised to see that R and the C++ are so similar?

The only things that are unfamiliar in this new function are the following:

  • the int data type (which just holds an integer)
  • the std::string data type (which holds a string, or a character vector, in R parlance)
  • the if statement (which is identical to R's)

Other than those things, this is just building upon what we've already learned with the first function.

Our last matter of business is to rewrite the single.core function in C++. To build up to that, let's first write a C++ function called sum2 that takes a numeric vector and returns the sum of all the numbers:

// [[Rcpp::export]]
double sum2(Rcpp::NumericVector a_vector){
    double running_sum = 0;
    int length = a_vector.size();
    for( int i = 0; i < length; i++ ){
        running_sum = running_sum + a_vector(i);
    }
    return(running_sum);
}

There are a few new things in this function:

  • We have to specify the data type of all the variables (including function arguments) in C++, but what's the data type of the R vector that we're to pass in to sum2? The import statement at the top of the C++ file allows us to use the Rcpp::NumericVector data type (which does not exist in standard C++).
  • To get the length of a NumericVector (like we would in R with the length function), we use the .size() method.
  • The C++ for loop is a little different than its R counterpart. To wit, it takes three fields, separated by semicolons; the first field initializes a counter variable, the second field specifies the conditions under which the for loop will continue (we'll stop iterating when our counter index is the length of the vector), and the third is how we update the counter from iteration to iteration (i++ means add 1 to i). All in all, this for loop is equivalent to a for loop in R that starts with for(i in 1:length).
  • The way to subscript a vector in C++ is by using parentheses, not brackets. We will also be using parentheses when we start subscripting matrices.

At every iteration, we use the counter as an index into the NumericVector, and extract the current element, we update the running sum with the current element, and when the loop ends, we return the running sum.

Please note before we go on that the first element of any vector in C++ is the 0th element, not the first. For example, the third element of a vector called victor is victor[3] in R, whereas it would be victor(2) in C++. This is why the second field of the for loop is i < length and not i <= length.

Now, we're finally ready to rewrite the single.core function from the last section in C++!

// [[Rcpp::export]]
double single_core_cpp(Rcpp::NumericMatrix mat){
    int nrows = mat.nrow();
    int numcomps = nrows*(nrows-1)/2;
    double running_sum = 0;
    for( int i = 0; i < nrows; i++ ){
        for( int j = i+1; j < nrows; j++){
            double this_dist = haversine_cpp(mat(i,0), mat(i,1),
                                             mat(j,0), mat(j,1));

            running_sum = running_sum + this_dist;
        }
    }
    return running_sum / numcomps;
}

Nothing here should be too new. The only two new components are that we are taking a new data type, a Rcpp::NumericMatrix, as an argument, and that we are using .nrow() to get the number of rows in a matrix.

Let's try it out! When we used the R function single.core, we called it with the whole airport data.frame as an argument. But since the C++ function takes a matrix of latitude/longitude pairs, we will simply drop the first column (holding the airport name) from the airport.locs data frame, and convert what's left into a matrix.

sourceCpp("our_cpp_functions.cpp")
the.matrix <- as.matrix(all.airport.locs[,-1])
system.time(ave.dist <- single_core_cpp(the.matrix))
print(ave.dist)
----------------------------------------
   user  system elapsed 
  0.012   0.000   0.012 
 [1] 1667.186

Okay, the task that used to take 5.5 seconds now takes less than one tenth of a second (and the outputs match, to boot!) Astoundingly, we can perform the task on all the 13,429 airports quite easily now:

the.matrix <- as.matrix(all.airport.locs[,-1])
system.time(ave.dist <- single_core_cpp(the.matrix))
print(ave.dist)
-------------------------------
   user  system elapsed 
 12.310   0.080  12.505 
 [1] 1869.744

Using Rcpp, it takes a mere 12.5 seconds to calculate and average 90,162,306 distances—a feat that would have taken even a 16 core server 17 minutes to complete.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.173.64