In a blog post that I penned showcasing the performance of this task under various optimization methods, I took it for granted that calculating the distances on the full dataset with the unparallelized/un-Rcpp-ed code would be a multi-hour affair—but I was seriously mistaken.
Shortly after publishing the post, a clever R programmer commented on it stating that they were able to slightly rework the code so that the serial/pure-R code took less than 20 seconds to complete with all the 13,429 observations. How? Vectorization.
single.core.improved <- function(airport.locs){ numrows <- nrow(airport.locs) running.sum <- 0 for (i in 1:(numrows-1)) { this.dist <- sum(haversine(airport.locs[i,2], airport.locs[i, 3], airport.locs[(i+1):numrows, 2], airport.locs[(i+1):numrows, 3])) running.sum <- running.sum + this.dist } return(running.sum / (numrows*(numrows-1)/2)) } system.time(ave.dist <- single.core.improved(all.airport.locs)) print(ave.dist) ------------------------------------------------------------------ user system elapsed 15.537 0.173 15.866 [1] 1869.744
Not even 16 seconds. It's worth following what this code is doing.
There is only one for
loop that is making its rounds down the number of rows in the airport.locs
data frame. On each iteration of the for
loop, it calls the haversine
function just once. The first two arguments are the latitude and longitude of the row that the loop is on. The third and fourth arguments, however, are the vectors of the latitudes and longitudes below the current row. This returns a vector of all the distances from the current airport to the airports below it in the dataset. Since the haversine
function could just as easily take vectors instead of single numbers, there is no need for a second for loop.
So the haversine
function was already vectorized, I just didn't realize it. You'd think that this would be embarrassing for someone who professes to know enough about R to write a book about it. Perhaps it should be. But I found out that one of the best ways to learn—especially about code optimization—is through experimentation and making mistakes.
For example, when I started learning about writing high performance R code for both fun and profit, I made quite a few mistakes. One of my first blunders/failed experiments was with this very task; when I first learned about Rcpp, I used it to translate the to.radians
and haversine
functions only. Having the loop remain in R proved to only give a slight performance edge—nothing compared to the 12.5 second business we've achieved together. Now I know that the bulk of the performance degradation was due to the millions of function calls to haversine
—not the actual computation in the haversine
function. You could learn that and other lessons most effectively by continuing to try and messing up on your own.
The moral of the story: when you think you've vectorized your code enough, find someone smarter than you to tell you that you're wrong.
18.225.156.158