CHAPTER 18

image

High-Performance Computing

In this chapter, we introduce high-performance computing. Broadly, computing may be slow because we have larger datasets and/or because we are doing more computations. We will talk a little about ways to deal with both in R. When people say high-performance computing or big data they can mean very different things. In this chapter we will not discuss processing terabytes of data or analyses that require large clusters. Instead, we only assume a fairly standard desktop or laptop that has at least two cores. Note that some of the code examples in this chapter may take some time to run. That is deliberate so it is both more realistic and starts to convey the “feel” of larger data. Making a typo can be more painful when it takes minutes or hours to get your result and you find out it is wrong and you have to run it again!

18.1 Data

First, we will discuss working with larger datasets. The R package nycflights13 (Wickham, 2014) has some datasets with a few hundred thousand observations.

> install.packages("nycflights13")
> install.packages("iterators")
> library(nycflights13)
> library(iterators)
> head(flights)
  year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
1 2013     1   1      517         2      830        11      UA  N14228   1545
2 2013     1   1      533         4      850        20      UA  N24211   1714
3 2013     1   1      542         2      923        33      AA  N619AA   1141
4 2013     1   1      544        -1     1004       -18      B6  N804JB    725
5 2013     1   1      554        -6      812       -25      DL  N668DN    461
6 2013     1   1      554        -4      740        12      UA  N39463   1696
  origin dest air_time distance hour minute
1    EWR  IAH      227     1400    5     17
2    LGA  IAH      227     1416    5     33
3    JFK  MIA      160     1089    5     42
4    JFK  BQN      183     1576    5     44
5    LGA  ATL      116      762    5     54
6    EWR  ORD      150      719    5     54

Now suppose that we wanted to create a new variable in the data that was the standard deviation of arrival time delays by destination airport. We could use the ave() function. It performs an operation (such as calculating the standard deviation) by an index. For example, suppose we had some data from three different groups.

Group 1

Group 2

Group 3

1

4

7

2

5

8

3

6

9

If these data were shaped in “long” form, they would look like as shown in the table that follows, where all the values are in one column, and another column indicates which group each value belongs to

Value

Group

1

1

2

1

3

1

4

2

5

2

6

2

7

3

8

3

9

3

Now, what if we wanted to calculate something by each group? For example, we could calculate the means per group, which are 2, 5, and 8 for groups 1, 2, and 3, respectively. However, if we want the mean per group to be a new variable in the dataset, we need to fill in the mean of group 1 for every row belonging to group 1, and the mean of group 2 for every row belonging to group 2. In other words, the mean (or whatever we calculate) needs to be repeated as many times as the data from which it came. That is exactly what ave() does and following is a simple example of ave() in action. The first argument is the data to use for calculation. The second argument is the index variable (in our previous example, group), and the third argument is the function to be used for calculation. You can see in the output that the means, 2, 5, and 8, have each been repeated the same number of times as the index variable

> ave(1:9, c(1, 1, 1, 2, 2, 2, 3, 3, 3), FUN = mean)
[1] 2 2 2 5 5 5 8 8 8

Following is one way we may try to accomplish our goal in the flights data. We will make use of the system.time() function throughout this chapter to examine how long it takes to run different pieces of code, noting that these results are not intended to indicate how long it will take on your machine or your data but to compare different coding approaches. The results are in seconds, and we will focus on the elapsed time.

system.time(
+ flights <- within(flights, {
+   ArrDelaySD <- ave(arr_delay, dest, FUN = function(x) sd(x, na.rm = TRUE))
+ })
+ )
   user  system elapsed
   0.05    0.01    0.06

What about just finding the mean arrival delay for flights departing in the first half of the year?

system.time(
+   mean(subset(flights, month < 7)$arr_delay)
+ )
   user  system elapsed
   0.22    0.00    0.22

It does not take long in this case, but it is starting to be noticeable and we are still dealing with small data. Often, great performance improvements can be had just by using optimized code. The data.table package provides an alternative to data frames that makes fewer copies and can be much faster. We are using version 1.9.5 data.table package for this chapter, as there have been some recent advances. As of the writing of this book, 1.9.5 is the development version, which is only available from GitHub, but thanks to the devtools package, which has a function install_github(), it is easy to install the latest development versions of any R package hosted on GitHub. Although generally it is perhaps safer to use packages on CRAN as they are more stable, as you start to push the limits of R and come closer to the cutting edge, it is helpful to be able to install from GitHub, where many R package developers choose to host the development source code of their packages. If you are on Windows or Mac and have not already, now is good time to get the development tools. For Windows, you can download them from https://cran.r-project.org/bin/windows/Rtools/. For Mac, all you need to do is install Xcode.

> install.packages("devtools")
> library(devtools)
> install_github("Rdatatable/data.table")
Downloading github repo Rdatatable/data.table@master
Installing data.table
## Much more output here about the package being compiled
mv data.table.dll datatable.dll
installing to c:/usr/R/R-3.2.2/library/data.table/libs/x64
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (data.table)

Now, we can load the data.table package and convert the flights data into a data.table object, and then try out the same calculations listed previously.

> library(data.table)
data.table 1.9.5  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.

> flights2 <- as.data.table(flights)

> system.time(
+ flights2[, ArrDelaySD := sd(arr_delay, na.rm = TRUE), by = dest]
+ )
   user  system elapsed
   0.01    0.00    0.02
> all.equal(flights2$ArrDelaySD, flights$ArrDelaySD)
[1] TRUE

> system.time(
+ mean(flights2[month < 7]$arr_delay)
+ )
   user  system elapsed
   0.03    0.02    0.04

Using data.table gives identical results and the code is much faster. Although the data.table package does not use multiple cores, it is highly optimized for speed and also tries to reduce memory usage. In regular R data frames, many common operations result in copies being made. In data.table, objects are often modified in place meaning where needed, and data are changed in place in memory rather than making a whole new copy. The end result is that basic data manipulation operations may work several times faster using data.table than using a regular data frame in R. Next, we will explore how to use data.table in more detail, as its syntax differs from those of data frames in some important ways.

To start, we can see what happens if we just type the data.table object in R.

> flights2
        year month day dep_time dep_delay arr_time arr_delay carrier tailnum
     1: 2013     1   1      517         2      830        11      UA  N14228
     2: 2013     1   1      533         4      850        20      UA  N24211
     3: 2013     1   1      542         2      923        33      AA  N619AA
     4: 2013     1   1      544        -1     1004       -18      B6  N804JB
     5: 2013     1   1      554        -6      812       -25      DL  N668DN
    ---
336772: 2013     9  30       NA        NA       NA        NA      9E
336773: 2013     9  30       NA        NA       NA        NA      9E
336774: 2013     9  30       NA        NA       NA        NA      MQ  N535MQ
336775: 2013     9  30       NA        NA       NA        NA      MQ  N511MQ
336776: 2013     9  30       NA        NA       NA        NA      MQ  N839MQ
        flight origin dest air_time distance hour minute ArrDelaySD
     1:   1545    EWR  IAH      227     1400    5     17   41.00647
     2:   1714    LGA  IAH      227     1416    5     33   41.00647
     3:   1141    JFK  MIA      160     1089    5     42   41.29391
     4:    725    JFK  BQN      183     1576    5     44   34.45790
     5:    461    LGA  ATL      116      762    5     54   46.96864
    ---
336772:   3393    JFK  DCA       NA      213   NA     NA   39.91506
336773:   3525    LGA  SYR       NA      198   NA     NA   41.84991
336774:   3461    LGA  BNA       NA      764   NA     NA   48.34005
336775:   3572    LGA  CLE       NA      419   NA     NA   45.83643
336776:   3531    LGA  RDU       NA      431   NA     NA   42.26542

The data.table object has printing methods that automatically print the first few and last few rows, rather than returning everything. In data.table any object within the brackets tends to be assumed to be a variable in the dataset. This often means that you do not need to explicitly reference the dataset or use quotes. For example, to select all rows of the data where the carrier was Delta, we type the following:

> flights2[carrier == "DL"]
## output omitted

This is compared with how we would do the same thing in base R:

> head(flights[flights$carrier == "DL", ])
> head(subset(flights, carrier == "DL"))
## output omitted

We can select columns in a similar way. For example, to tabulate the destinations of Delta flights, we can type the following:

> table(flights2[carrier == "DL", dest])

  ATL   AUS   BNA   BOS   BUF   CVG   DCA   DEN   DTW   EYW   FLL   IND   JAC
10571   357     1   972     3     4     2  1043  3875    17  2903     2     2
  JAX   LAS   LAX   MCI   MCO   MEM   MIA   MSP   MSY   OMA   PBI   PDX   PHL
    1  1673  2501    82  3663   432  2929  2864  1129     1  1466   458     2
  PHX   PIT   PWM   RSW   SAN   SAT   SEA   SFO   SJU   SLC   SRQ   STL   STT
  469   250   235   426   575   303  1213  1858  1301  2102   265     1    30
  TPA
 2129

To create a new variable, we can use the := syntax. For example, we can create a new variable encoding the difference between departing and arrival delays (a similar approach could be used to create other scores, such as the sum of correct answers on a test).

> flights2[, NewVariable := dep_delay - arr_delay]
> colnames(flights2)
 [1] "year"        "month"       "day"         "dep_time"    "dep_delay"
 [6] "arr_time"    "arr_delay"   "carrier"     "tailnum"     "flight"
[11] "origin"      "dest"        "air_time"    "distance"    "hour"
[16] "minute"      "ArrDelaySD"  "NewVariable"

We can also make a new variable by recoding an existing variable. Here we will overwrite NewVariable. Suppose we consider delays greater than two hours to be true delays and otherwise just a variation around normal (i.e., no delay). We can overwrite the variable using ifelse() to encode delays greater than 120 minutes as “Delayed” and everything else as “No Delay”. We can then count the number of delays and no delays using .N which returns the number of the last row in data.table, and doing this by NewVariable, which effectively works to count, much like table().

> flights2[, NewVariable := ifelse(arr_delay > 120, "Delayed", "No Delay")]
> flights2[, .N, by = NewVariable]
   NewVariable      N
1:    No Delay 317312
2:     Delayed  10034
3:          NA   9430

Here we can see that data.table also listed the number of missing values. Another useful way to use .N is as an index. This can be particularly powerful combined with order(). For example, suppose we wanted to see the least and most delayed flight arrivals by whether they meet our definition of delayed (> 120 minutes) or not. The example that follows orders by our NewVariable and then by arrival delay, and then gets the first two and last two rows for arrival delay by NewVariable. Note that 1:0 expands into c(1, 0), which is helpful as we could also have written 5:0 if we wanted the last five, making it easy to get however many first or last values we want.

> flights2[order(NewVariable, arr_delay), arr_delay[c(1:2, .N - 1:0)], by = NewVariable]
    NewVariable   V1
 1:     Delayed  121
 2:     Delayed  121
 3:     Delayed 1127
 4:     Delayed 1272
 5:    No Delay  -86
 6:    No Delay  -79
 7:    No Delay  120
 8:    No Delay  120
 9:          NA   NA
10:          NA   NA
11:          NA   NA
12:          NA   NA

Another common operation is dropping a variable. To remove a variable, simply set it to NULL. In data.table this is a very fast operation as no copy of the dataset is made, unlike in base R where the data are essentially copied without that variable.

> flights2[, NewVariable := NULL]
> colnames(flights2)
 [1] "year"       "month"      "day"        "dep_time"   "dep_delay"
 [6] "arr_time"   "arr_delay"  "carrier"    "tailnum"    "flight"
[11] "origin"     "dest"       "air_time"   "distance"   "hour"
[16] "minute"     "ArrDelaySD"

If only certain rows of the data are selected when a variable is created, the rest will be missing.

> flights2[carrier == "DL", NewVariable := "Test"]
> table(is.na(flights2[carrier == "DL", NewVariable]))

FALSE
48110
> table(is.na(flights2[carrier != "DL", NewVariable]))

  TRUE
288666
> flights2[, NewVariable := NULL]

data.table also has a very powerful and flexible way of performing operations by a variable in the dataset (e.g., getting the mean delay by month of the year).

> flights2[, mean(arr_delay, na.rm=TRUE), by = month]
    month         V1
 1:     1  6.1299720
 2:    10 -0.1670627
 3:    11  0.4613474
 4:    12 14.8703553
 5:     2  5.6130194
 6:     3  5.8075765
 7:     4 11.1760630
 8:     5  3.5215088
 9:     6 16.4813296
10:     7 16.7113067
11:     8  6.0406524
12:     9 -4.0183636

It is even easy to make multiple summary variables by another variable.

> flights2[, .(M = mean(arr_delay, na.rm=TRUE),
+              SD = sd(arr_delay, na.rm=TRUE)),
+          by = month]
    month          M       SD
 1:     1  6.1299720 40.42390
 2:    10 -0.1670627 32.64986
 3:    11  0.4613474 31.38741
 4:    12 14.8703553 46.13311
 5:     2  5.6130194 39.52862
 6:     3  5.8075765 44.11919
 7:     4 11.1760630 47.49115
 8:     5  3.5215088 44.23761
 9:     6 16.4813296 56.13087
10:     7 16.7113067 57.11709
11:     8  6.0406524 42.59514
12:     9 -4.0183636 39.71031

Or to do so by multiple variables, such as by month and by destination.

> flights2[, .(M = mean(arr_delay, na.rm=TRUE),
+              SD = sd(arr_delay, na.rm=TRUE)),
+          by = .(month, dest)]
      month dest           M       SD
   1:     1  IAH   4.1627907 33.74079
   2:     1  MIA  -2.1506148 32.42194
   3:     1  BQN   2.6451613 30.01545
   4:     1  ATL   4.1520468 34.17429
   5:     1  ORD   7.2876936 47.88168
  ---
1109:     9  TYS -14.0425532 30.62605
1110:     9  BHM  -0.2727273 49.71172
1111:     9  ALB -11.3684211 17.26657
1112:     9  CHO  10.2105263 40.62098
1113:     9  ILM  -7.9000000 26.79140

Notice that data.table also automatically includes the grouping by variables (here month and dest) in the results so we know what each mean and standard deviation apply to. Sometimes, how we want to group data by is not always a variable in the dataset directly. For example, we could compare mean flight delays between Fall and Spring vs. Spring to Fall months. Operations can be done directly within the by statement.

> flights2[, .(M = mean(arr_delay, na.rm=TRUE),
+              SD = sd(arr_delay, na.rm=TRUE)),
+          by = .(Winter = month %in% c(9:12, 1:3))]
   Winter         M       SD
1:   TRUE  4.038362 39.83271
2:  FALSE 10.727385 50.10366

To include our summary variables in the original dataset rather than a new summarized dataset, we use the := operator again. Here we also show how to create multiple new variables at once, rather than having to create one new variable at a time. Note that the values will be recycled to fill as many rows as the dataset (in the code that follows we can see the mean for all rows for month 1 are the same).

> flights2[, c("MonthDelayM", "MonthDelaySD") := .(
+   mean(arr_delay, na.rm=TRUE),
+   sd(arr_delay, na.rm=TRUE)), by = month]
> ## view results
> flights2[, .(month, MonthDelayM, MonthDelaySD)]
        month MonthDelayM MonthDelaySD
     1:     1    6.129972     40.42390
     2:     1    6.129972     40.42390
     3:     1    6.129972     40.42390
     4:     1    6.129972     40.42390
     5:     1    6.129972     40.42390
    ---
336772:     9   -4.018364     39.71031
336773:     9   -4.018364     39.71031
336774:     9   -4.018364     39.71031
336775:     9   -4.018364     39.71031
336776:     9   -4.018364     39.71031

If there is a key, such as an ID or some other variable we will often be summarizing others by, we can use the setkey() function which sorts and indexes the data, making operations involving the key variable much faster. For example, we can set month as the key.

> setkey(flights2, month)

Once the key is set, we can refer to it using the J() operator. For example, to get months 3 to 7, we can type the following:

> system.time(flights2[J(3:7)])
   user  system elapsed
   0.01    0.00    0.01

which is much faster than the equivalent in base R.

> system.time(subset(flights, month %in% 3:7))
   user  system elapsed
   0.16    0.05    0.20

Here we can see that data.table has a tremendous speed advantage (admittedly, it takes a bit of time to set the key in the first place, but for repeated use, that is a one-time cost). It might seem difficult to use a whole new type of data structure, but because data.table inherits from data frame, most functions that work on a data frame will work on a data.table object. If they are designed for it, they may be much faster. If not, at least they will still work as well as for a regular data frame. For example, in a regular linear regression:

> summary(lm(arr_delay ~ dep_time, data = flights2))

Call:
lm(formula = arr_delay ~ dep_time, data = flights2)

Residuals:
    Min      1Q  Median      3Q     Max
-100.67  -23.21   -8.76    9.07 1280.13

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.174e+01  2.229e-01  -97.55   <2e-16 ***
dep_time     2.123e-02  1.554e-04  136.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.41 on 327344 degrees of freedom
  (9430 observations deleted due to missingness)
Multiple R-squared:  0.05397,   Adjusted R-squared:  0.05396
F-statistic: 1.867e+04 on 1 and 327344 DF,  p-value: < 2.2e-16

Operations can also be paired with subsetting. For example, earlier we saw how to use the .N convenience function. Now we count how many flights in each month were delayed by more than 12 hours. Here there are only ten rows because some months had zero flights delayed by more than 12 hours. April and June appear to have been particularly bad months.

> flights2[arr_delay > 60*12, .N, by = month]
    month N
 1:     1 3
 2:     2 4
 3:     3 2
 4:     4 5
 5:     5 2
 6:     6 5
 7:     7 2
 8:     9 1
 9:    11 1
10:    12 4

Another dataset, airlines, has the full names of each carrier. We can merge it with the flights dataset to have detailed carrier names (first converting the airlines data into a data.table object). This merging (or join) is fast because the data are already ordered by the key following the call to setkey(). Note that it is also possible to do nested joins (e.g., Dataset1[Dataset2[Dataset3]]), as long as the datasets all use the same keys they will be evaluated from the innermost outward. For joins, the nomatch argument controls what happens if no match can be found, either filling it with missing values (NA) or the default, or dropping those rows if nomatch = 0 is specified. The documentation for ?data.table has more details.

> airlines2 <- as.data.table(airlines)
> setkey(airlines2, carrier)
> setkey(flights2, carrier)

> ## join the data.tables by their key
> flights3 <- flights2[airlines2]

> ## view just three variables
> flights3[, .(year, carrier, name)]
        year carrier               name
     1: 2013      9E  Endeavor Air Inc.
     2: 2013      9E  Endeavor Air Inc.
     3: 2013      9E  Endeavor Air Inc.
     4: 2013      9E  Endeavor Air Inc.
     5: 2013      9E  Endeavor Air Inc.
    ---
336772: 2013      YV Mesa Airlines Inc.
336773: 2013      YV Mesa Airlines Inc.
336774: 2013      YV Mesa Airlines Inc.
336775: 2013      YV Mesa Airlines Inc.
336776: 2013      YV Mesa Airlines Inc.

Joinings can also be done by multiple keys. For example, another dataset has weather data by month, day, and airport. To join this with the flights data, we would set month, day, and origin airport as keys. Of course we have to convert the weather data frame to a data.table object first.

> weather2 <- as.data.table(weather)
> weather2
      origin year month day hour  temp  dewp humid wind_dir wind_speed
   1:    EWR 2013    NA  NA   NA 44.96 17.96 33.55       20    3.45234
   2:    EWR 2013     1   1    0 37.04 21.92 53.97      230   10.35702
   3:    EWR 2013     1   1    1 37.04 21.92 53.97      230   13.80936
   4:    EWR 2013     1   1    2 37.94 21.92 52.09      230   12.65858
   5:    EWR 2013     1   1    3 37.94 23.00 54.51      230   13.80936
  ---
8715:    EWR 2013    12  30   19 37.04 21.02 51.95      320   17.26170
8716:    EWR 2013    12  30   20 35.06 17.96 49.30      340   17.26170
8717:    EWR 2013    12  30   21 33.08 15.98 48.98      320   14.96014
8718:    EWR 2013    12  30   22 30.92 12.92 46.74      340   16.11092
8719:    EWR 2013    12  30   23 28.94 12.02 48.69      330   14.96014
      wind_gust precip pressure visib
   1:  3.972884      0   1025.9    10
   2: 11.918651      0   1013.9    10
   3: 15.891535      0   1013.0    10
   4: 14.567241      0   1012.6    10
   5: 15.891535      0   1012.7    10
  ---
8715: 19.864419      0   1017.6    10
8716: 19.864419      0   1019.1    10
8717: 17.215830      0   1019.8    10
8718: 18.540125      0   1020.5    10
8719: 17.215830      0   1021.1    10

> setkey(flights2, month, day, origin)
> setkey(weather2, month, day, origin)

Because the weather data have hourly data, before we can join, we need to collapse the data somehow. We will take the mean. One way to do this is just by writing out each column we care about.

> weather2b <- weather2[, .(temp = mean(temp, na.rm=TRUE),
+                          precip = mean(precip, na.rm=TRUE),
+                          visib = mean(visib, na.rm=TRUE)),
+                          by = .(month, day, origin)]
> weather2b
     month day origin    temp     precip     visib
  1:    NA  NA    EWR 44.9600 0.00000000 10.000000
  2:     1   1    EWR 38.4800 0.00000000 10.000000
  3:     1   2    EWR 28.8350 0.00000000 10.000000
  4:     1   3    EWR 29.4575 0.00000000 10.000000
  5:     1   4    EWR 33.4775 0.00000000 10.000000
 ---
369:    12  26    EWR 31.0475 0.00000000  9.541667
370:    12  27    EWR 34.2425 0.00000000 10.000000
371:    12  28    EWR 39.1550 0.00000000 10.000000
372:    12  29    EWR 43.0475 0.03291667  7.947917
373:    12  30    EWR 38.9000 0.00000000 10.000000

However, writing each column or variable name becomes time-consuming when there are many columns. Fortunately, there is a way around this. There is another special way to refer to the columns in a data.table object, .SD. We can use this to get the mean of all columns except the ones we are grouping by.

> weather2c <- weather2[, lapply(.SD, mean, na.rm=TRUE),
+                          by = .(month, day, origin)]
> weather2c
     month day origin year     hour    temp     dewp    humid wind_dir
  1:    NA  NA    EWR 2013      NaN 44.9600 17.96000 33.55000  20.0000
  2:     1   1    EWR 2013 11.78261 38.4800 25.05043 58.38609 263.0435
  3:     1   2    EWR 2013 11.50000 28.8350 11.38250 47.78625 307.9167
  4:     1   3    EWR 2013 11.50000 29.4575 14.78000 54.39583 276.9565
  5:     1   4    EWR 2013 11.50000 33.4775 19.20500 55.88042 242.9167
 ---
369:    12  26    EWR 2013 11.50000 31.0475 19.04750 60.90417 153.7500
370:    12  27    EWR 2013 11.50000 34.2425 19.87250 56.68750 253.7500
371:    12  28    EWR 2013 11.50000 39.1550 23.00750 54.89750 222.9167
372:    12  29    EWR 2013 11.50000 43.0475 32.33000 67.60208 166.5217
373:    12  30    EWR 2013 11.50000 38.9000 30.71750 73.83875 280.8333
     wind_speed wind_gust     precip pressure     visib
  1:   3.452340  3.972884 0.00000000 1025.900 10.000000
  2:  12.758648 14.682397 0.00000000 1012.443 10.000000
  3:  12.514732 14.401704 0.00000000 1017.337 10.000000
  4:   7.863663  9.049346 0.00000000 1021.058 10.000000
  5:  13.857309 15.946714 0.00000000 1017.533 10.000000
 ---
369:   5.801849  6.676652 0.00000000 1027.129  9.541667
370:   8.343155  9.601136 0.00000000 1026.475 10.000000
371:   8.822647 10.152925 0.00000000 1023.117 10.000000
372:   8.103409  9.325241 0.03291667 1014.595  7.947917
373:  12.035241 13.849914 0.00000000 1012.541 10.000000

Now we are ready to join the datasets. We can see that we end up with the same number of rows but have now added additional columns for the weather data.

> flights4 <- weather2c[flights2]

> dim(flights2)
[1] 336776     17
> dim(flights4)
[1] 336776     28

Finally, in data.table, almost any operation can be done within the middle argument, even functions that are called for their side effects, not what they return. In the code that follows we calculate the regression of arrival delay on visibility and we do this by carrier (airline).

> flights4[, as.list(coef(lm(arr_delay ~ visib))), by = carrier]
    carrier (Intercept)     visib
 1:      AA    46.58181 -4.859830
 2:      AS    51.46830 -6.645643
 3:      B6    54.26659 -4.758500
 4:      DL    49.36366 -4.505132
 5:      EV    78.28173 -6.600792
 6:      MQ    52.29302 -3.868518
 7:      UA    35.47410 -3.463089
 8:      US    38.34697 -4.007031
 9:      WN    65.21767 -5.847156
10:      9E    45.61693 -4.690411
11:      HA   -29.45361  2.268041
12:      VX    25.30893 -2.789938
13:      F9    12.50000        NA
14:      FL    97.11111        NA
15:      YV     4.00000        NA
16:      OO   -88.79863 11.939863

We can even make plots. To catch multiple plots, we use the par() function. Of course, as we saw in Chapter 17, it would be easy to do this in ggplot2; this is just an example of the different operations you can do in a data.table object and by other variables. Because the plot() function creates a plot but returns no data, we end up with an empty data.table object.

> par(mfrow = c(4, 3))
> flights4[, plot(density(na.omit(arr_delay)), main = "Arival Delay", xlab = "", ylab = "Density"), by = month]
Empty data.table (0 rows) of 1 col: month

9781484203743_unFig18-01.jpg

So far we have just explored the data.table package in R, which allows for faster and somewhat more memory-efficient data management in R. However, it still requires the data to be loaded into memory. Many larger datasets will be too big for memory. For data that cannot be loaded into memory, there are a few options. The ff package uses flat files stored on disk and links them to R. The dplyr package also supports linking to databases, including SQLite, MySQL, Postgresql, and Bigquery. Although, ultimately, the data often have to be read into R, you rarely need all the data at once. For example, from a large database, you may only need to select certain observations, and read in two variables to examine their relationship. Even if the full dataset cannot fit in memory, by linking to a database and only pulling in what you need when you need it, you can use the memory available to go a lot farther. The package homepage for dplyr (https://cran.r-project.org/web/packages/dplyr/) has many introductions to these topics.

18.2 Parallel Processing

The other easy way to increase performance for some operations is through parallel processing. Again we are going to restrict ourselves to using multiple cores just on one machine, not distributed computing. To see many of the packages available to help with this, a good place to start is the High-Performance Computing CRAN task view (https://cran.r-project.org/web/views/HighPerformanceComputing.html). One thing worth noting is that many of the parallel processing functions work better or only work on Linux or Mac (built on Linux). In order to make this chapter more general, we will focus on methods that apply across operating systems.

The parallel package is built into R now and provides a few facilities for parallel processing. A little later, we’ll examine other options. To start, we load the package. For Linux machines, there are some multicore functions we could use straight away, but to make this generic, we will create a local cluster that takes advantage of multiple cores and works on Linux, Mac, and Windows. We will make one assuming four cores are available. If you have two cores, you would just change the 4 to a 2. If you have more cores, you could increase the number. If you don’t know what your computer has, you can use the detectCores() function, which should tell you (note that this does not distinguish physical and logical cores, so, for example, two physical cores with hyperthreading will count as four).

> library(parallel)
> cl <- makeCluster(4)

Because there is some overhead in sending commands to and getting results back from the cluster, for trivial operations, like addition, it may actually be slower to use the parallel version. We have used lapply() function before to loop through some index and perform operations. Here we will use parLapply() which is a parallel version, and the main workhorse.

> system.time(lapply(1:1000, function(i) i + 1))
   user  system elapsed
      0       0       0
> system.time(parLapply(cl, 1:1000, function(i) i + 1))
   user  system elapsed
      0       0       0

We can notice the real-time difference as the task becomes more computationally demanding.

> time1 <- system.time(lapply(1:1000, function(i) mean(rnorm(4e4))))
> time2 <- system.time(parLapply(cl, 1:1000, function(i) mean(rnorm(4e4))))

The nonparallel version here took about four times as long (the code that follows shows how to get the ratio of elapsed time), which is what we would expect for this sort of easily parallelized example (easy because no data need to be transferred, results are simple, each task is about equally computationally demanding, and no operations depend on a previous operation).

> time1["elapsed"] / time2["elapsed"]
 elapsed
4.063063

To give a practical example of the benefits of parallelization, we can go back to our bootstrapping example from Chapter 16.

> library(boot)
> library(VGAM)
> library(foreign)

> gss2012 <- read.spss("GSS2012merged_R5.sav", to.data.frame = TRUE)
> gssr <- gss2012[, c("age", "sex", "marital", "educ", "income06", "satfin", "happy", "health")]
> gssr <- na.omit(gssr)
> gssr <- within(gssr, {
+   age <- as.numeric(age)
+   Agec <- (gssr$age - 18) / 10
+   educ <- as.numeric(educ)
+   # recode income categories to numeric
+   cincome <- as.numeric(income06)
+   satfin <- factor(satfin,
+                    levels = c("NOT AT ALL SAT", "MORE OR LESS", "SATISFIED"),
+                    ordered = TRUE)
+ })

> m <- vglm(satfin ~ Agec + cincome * educ,
+           family = cumulative(link = "logit", parallel = TRUE, reverse = TRUE),
+           data = gssr)

> ## write function to pass to boot()
> model_coef_predictions <- function(d, i) {
+
+   m.tmp <- vglm(satfin ~ Agec + cincome * educ,
+                 family = cumulative(link = "logit", parallel = TRUE, reverse = TRUE),
+                 data = d[i, ])
+   newdat <- expand.grid(
+     Agec = seq(from = 0, to = (89 - 18)/10, length.out = 50),
+     cincome = mean(d$cincom),
+     educ = c(12, 16, 20))
+   bs <- coef(m.tmp)
+   predicted.probs <- predict(m.tmp, newdata = newdat,
+                              type = "response")
+   out <- c(bs, predicted.probs[, 1], predicted.probs[, 2], predicted.probs[, 3])
+   return(out)
+ }

In order to use the bootstrap on the cluster, we need the cluster to have everything set up. For example, we need to load the relevant packages on the cluster. The boot() function is rather unique in R, in that it is designed to be parallelized and accepts arguments for a cluster to use. As we will see, most functions are not like that and rely on us being able to somehow break the task down into smaller chunks and distribute them to the cluster ourselves. We can evaluate commands on the cluster using the clusterEvalQ() function, which returns results from each of the nodes (here four).

> clusterEvalQ(cl, {
+   library(VGAM)
+ })
[[1]]
 [1] "VGAM"      "splines"   "stats4"    "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

[[2]]
 [1] "VGAM"      "splines"   "stats4"    "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

[[3]]
 [1] "VGAM"      "splines"   "stats4"    "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

[[4]]
 [1] "VGAM"      "splines"   "stats4"    "methods"   "stats"     "graphics"
 [7] "grDevices" "utils"     "datasets"  "base"

> clusterSetRNGStream(cl, iseed = 1234)
> boot.res <- boot(
+   data = gssr,
+   statistic = model_coef_predictions,
+   R = 5000,
+   parallel = "snow",
+   ncpus = 4,
+   cl = cl)

Next we calculate the 95% bias-corrected and accelerated bootstrap confidence intervals. The function call boot.ci() is not designed to be parallel, so we can use the parLapply() function again. Now because we are calling a boot.ci() function within the cluster, we need to load the book package on the cluster (before we called boot() which then had the cluster do operations, but boot() was called in our current R instance, not on the cluster; hence we did not need to load the boot package on the cluster before). We will also find that although our local R instance has the boot.res object, the cluster does not. We need to export the data from our local R instance to the cluster. Note that if we were using Linux, we could use mclapply() which is a multicore version and relies on forking which allows processes to share memory, reducing the need to export data explicitly as for the local cluster we created. However, the cluster approach works across operating systems, while the mclapply() approach does not.

> clusterEvalQ(cl, {
>   library(boot)
> })
> ## output omitted
> clusterExport(cl, varlist = "boot.res")

> boot.res2 <- parLapply(cl, 1:6, function(i) {
+   cis <- boot.ci(boot.res, index = i, type = "bca")
+   data.frame(Estimate = boot.res$t0[i],
+              LL = cis$bca[1, 4],
+              UL = cis$bca[1, 5])
+ })

Even parallelized, this code takes quite a bit of time to run. Here we just show it for the six coefficients, rather than all the predicted probabilities. We could easily change this by indexing over all of boot.res$t0 rather than only 1:6. It is substantially faster than the naive single-core version, and the actual code required to make it parallel is fairly easy.

> ## combine row-wise
> boot.res2 <- do.call(rbind, boot.res2)
> round(boot.res2, 3)
              Estimate     LL     UL
(Intercept):1    0.965  0.008  1.834
(Intercept):2   -1.189 -2.128 -0.300
Agec             0.169  0.128  0.215
cincome         -0.061 -0.112 -0.007
educ            -0.183 -0.251 -0.108
cincome:educ     0.012  0.008  0.016

18.2.1 Other Parallel Processing Approaches

Another approach to parallel processing is using the foreach package. The foreach package is really just a consistent front end to parallelize for loops. What is nice about it is that it can use a variety of parallel back ends, including multiple cores and clusters. This means that the appropriate back end for a specific system can be chosen and registered, and then the rest of the code will work the same. For this example, we will continue using the cluster we created, but first we will install the necessary packages. So in addition to loading the foreach package, we need to load the doSNOW package and then register the cluster we created. On Linux or Mac, we could load the doMC library instead and use the registerDoMC() specifying the number of cores to achieve a similar result but using forking instead of a local cluster.

> install.packages("foreach")
> install.packages("doSNOW")

> library(foreach)
> library(doSNOW)
> registerDoSNOW(cl)

From here, we can use the foreach() function to iterate over a variable and do something, here just taking the mean of some random data as we examined before using parLapply(). To make foreach() parallel, we use %dopar% instead of %do%. Another nice feature is that if no parallel back end is registered, %dopar% will still work, but it will run sequentially instead of in parallel. However, it will still run, which can be helpful for ensuring that code works (even if slowly) on many different machines and configurations. Finally, notice that we specify the function used to combine results. Since, for each run, we will get a single numeric mean, we can combine these into a vector using the c() function. For more complex examples, we could use different functions to combine the results. In the code that follows we show both approaches, along with timing and the histograms (Figure 18-1) to show that the results are comparable (differences are due to random variation).

> system.time(
+   res1 <- foreach(i = 1:1000, .combine = 'c') %do% mean(rnorm(4e4))
+ )
   user  system elapsed
   5.24    0.00    5.23
> system.time(
+   res2 <- foreach(i = 1:1000, .combine = 'c') %dopar% mean(rnorm(4e4))
+ )
   user  system elapsed
   0.53    0.02    1.94
> par(mfrow = c(1, 2))
> hist(res1)
> hist(res2)

9781484203743_Fig18-01.jpg

Figure 18-1. Histograms of results from sequential and parallel processing

The foreach package also makes use of the iterators package, which has some special iterators to make life easier. For example, suppose we wanted to calculate the coefficient of variation (the ratio of the mean to the variance) for each variable in a dataset. We can do this using the iter() function on the mtcars dataset, and specifying we want to iterate over the dataset by columns, and the variable that is passed should be called “x.” Here we choose to combine the results using rbind() to put them into a one-column matrix.

> foreach(x=iter(mtcars, by='col'), .combine = rbind) %dopar% (mean(x) / var(x))
                 [,1]
result.1   0.55309350
result.2   1.93994943
result.3   0.01502017
result.4   0.03120435
result.5  12.58061252
result.6   3.36047700
result.7   5.58967159
result.8   1.72222222
result.9   1.63157895
result.10  6.77407407
result.11  1.07805255

Now let’s consider a more advanced example. Suppose we wanted to regress every continuous variable in the diamonds dataset on cut, color, and clarity. More generally, it is not uncommon to have a fixed set of predictors and a variety of outcomes or the reverse—a fixed outcome but a variety of potential predictors. To test which sets of predictors or outcomes are related, we may want to iterate over the predictors (or outcomes) via running multiple independent regressions. To have nice results, we might write a short function to show the estimate and 95% confidence interval. We can then use the foreach() function to iterate over the continuous variables in the diamonds data, and use cbind() to combine the results column-wise.

> prettyout <- function(object) {
+   cis <- confint(object)
+   bs <- coef(object)
+   out <- sprintf("%0.2f [%0.2f, %0.2f]", bs, cis[, 1], cis[, 2])
+   names(out) <- names(bs)
+   return(out)
+ }

> continuous.vars <- sapply(diamonds, is.numeric)
> results <- foreach(dv=iter(diamonds[, continuous.vars], by='col'), .combine = cbind) %dopar%
+   prettyout(lm(dv ~ cut + color + clarity, data = diamonds))

Now we can print the results. To remove all the quotation marks, we explicitly call print() and use the quote = FALSE argument. To avoid too much output, we only examine the first two columns.

> print(results[, 1:2], quote = FALSE)
            result.1             result.2
(Intercept) 0.85 [0.85, 0.86]    62.24 [62.22, 62.27]
cut.L       -0.09 [-0.10, -0.07] -1.77 [-1.81, -1.72]
cut.Q       0.01 [-0.00, 0.02]   1.11 [1.07, 1.15]
cut.C       -0.08 [-0.09, -0.07] -0.01 [-0.05, 0.02]
cut^4       -0.02 [-0.03, -0.01] 0.26 [0.24, 0.29]
color.L     0.45 [0.44, 0.46]    0.19 [0.15, 0.23]
color.Q     0.07 [0.06, 0.08]    -0.01 [-0.04, 0.03]
color.C     -0.01 [-0.02, 0.00]  -0.07 [-0.10, -0.03]
color^4     0.01 [-0.00, 0.02]   0.03 [0.00, 0.06]
color^5     -0.02 [-0.03, -0.01] 0.02 [-0.01, 0.05]
color^6     0.01 [-0.00, 0.01]   -0.02 [-0.04, 0.01]
clarity.L   -0.64 [-0.66, -0.62] -0.41 [-0.47, -0.34]
clarity.Q   0.14 [0.12, 0.16]    0.10 [0.04, 0.17]
clarity.C   -0.03 [-0.04, -0.01] -0.16 [-0.22, -0.10]
clarity^4   0.01 [-0.00, 0.03]   0.08 [0.03, 0.12]
clarity^5   0.07 [0.06, 0.08]    -0.14 [-0.17, -0.10]
clarity^6   -0.02 [-0.03, -0.01] 0.07 [0.04, 0.10]
clarity^7   0.01 [0.00, 0.02]    -0.01 [-0.03, 0.02]

Next we examine another case where parallel processing can be helpful. Cross-validation is a commonly used technique in machine learning and other exploratory modeling. The idea is that models will tend to overfit the data, even if only slightly, and so evaluating the performance of a model on sample predictions will be overly optimistic. Instead, it is better to evaluate the performance of a model on out-of-sample predictions (i.e., on data not used in the model). K-fold cross-validation separates the data into k groups, and then systematically leaves one out, trains the model on the remaining k - 1 group, and then predicts from the model on the hold-out data, and iterates through until all parts of the data have been used for training and as the hold-out testing data. A common number for k is 10, which requires ten identical models to be run on different subsets of the data—a perfect case that is easy to parallelize.

First we will create a list of row indices to be dropped from a given training model and used as the hold-out test data, and then use foreach to iterate through. In the example code that follows, we want to calculate a cross-validated R2, then combine the results into a vector, and finally calculate the mean. We can specify all of that in the call to foreach(). Then the results are compared to a linear regression model on the full data.

> ## cross validated R squared
> drop.index <- tapply(1:nrow(gssr),
+   rep(1:10, each = ceiling(nrow(gssr)/10))[1:nrow(gssr)],
+   function(x) x)

> CV <- foreach(i = drop.index, .combine = 'c', .final = mean) %dopar% {
+   m <- lm(cincome ~ educ, data = gssr[-i, ])
+   cor(gssr[i, "cincome"], predict(m, newdata = gssr[i, ]))^2
+ }

> summary(lm(cincome ~ educ, data = gssr))

Call:
lm(formula = cincome ~ educ, data = gssr)

Residuals:
     Min       1Q   Median       3Q      Max
-21.2006  -2.8900   0.9035   3.6270  14.0418

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.64760    0.45438   12.43   <2e-16 ***
educ         0.82765    0.03224   25.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.183 on 2833 degrees of freedom
Multiple R-squared:  0.1887,    Adjusted R-squared:  0.1884
F-statistic: 658.9 on 1 and 2833 DF,  p-value: < 2.2e-16

> CV
[1] 0.1829171

Although the results here are not too different, it illustrates how cross-validation can be performed, in parallel, to obtain more realistic estimates of model performance. Of course in this case, we have the adjusted R2, but for performance statistics where we do not know how to adjust them, cross-validation provides a computationally intense but conceptually straightforward approach.

The foreach package can also handle nested parallel loops. For example, we could use cross-validation combined with screening a number of predictor variables to identify which predictors explained the largest amount of variance in income bins in the GSS data. This can be done by chaining foreach() calls together using the %:% operator. The result looks like this and at the end we add the column names back and show the variance accounted for across all ten cross-validations

> CV2 <-
+   foreach(x = iter(gssr[, 1:4], by = "col"), .combine = "cbind") %:%
+     foreach(i = drop.index, .combine = 'rbind') %dopar% {
+       x2 <- x[-i]
+       m <- lm(gssr$cincome[-i] ~ x2)
+       cor(gssr$cincome[i], predict(m, newdata = data.frame(x2 = x[i])))^2
+ }

> colnames(CV2) <- colnames(gssr)[1:4]
> round(CV2, 2)
           age  sex marital educ
result.1  0.00 0.02    0.19 0.19
result.2  0.00 0.00    0.19 0.23
result.3  0.01 0.01    0.27 0.19
result.4  0.01 0.01    0.18 0.15
result.5  0.01 0.02    0.22 0.23
result.6  0.01 0.02    0.22 0.13
result.7  0.04 0.00    0.14 0.16
result.8  0.00 0.02    0.19 0.24
result.9  0.00 0.06    0.27 0.21
result.10 0.00 0.00    0.25 0.09

We will close this section by showing how to do conditional evaluation. Recall that in our first experiment with the iterators, we iterated over the numeric columns of the diamonds dataset. Using the same operator we used for nested loops, %:%, we can add when statements, so that the functions are only executed when certain conditions are meant. For example, we could use a when statement to only include numeric variables, rather than preselecting only columns we knew were numeric. This can also be useful for resampling statistics like the bootstrap. For example, if one of the predictors was a rare event, it is possible that in a particular bootstrap sample, none of the sampled cases would have the event, in which case the predictor would have no variability. We might prefer to not evaluate these rather than get results with missing or infinite coefficients.

> results <- foreach(dv=iter(diamonds, by='col'), .combine = cbind) %:%
+   when(is.numeric(dv)) %dopar%
+   prettyout(lm(dv ~ cut + color + clarity, data = diamonds))

> print(results[, 1:2], quote = FALSE)
            result.1             result.2
(Intercept) 0.85 [0.85, 0.86]    62.24 [62.22, 62.27]
cut.L       -0.09 [-0.10, -0.07] -1.77 [-1.81, -1.72]
cut.Q       0.01 [-0.00, 0.02]   1.11 [1.07, 1.15]
cut.C       -0.08 [-0.09, -0.07] -0.01 [-0.05, 0.02]
cut^4       -0.02 [-0.03, -0.01] 0.26 [0.24, 0.29]
color.L     0.45 [0.44, 0.46]    0.19 [0.15, 0.23]
color.Q     0.07 [0.06, 0.08]    -0.01 [-0.04, 0.03]
color.C     -0.01 [-0.02, 0.00]  -0.07 [-0.10, -0.03]
color^4     0.01 [-0.00, 0.02]   0.03 [0.00, 0.06]
color^5     -0.02 [-0.03, -0.01] 0.02 [-0.01, 0.05]
color^6     0.01 [-0.00, 0.01]   -0.02 [-0.04, 0.01]
clarity.L   -0.64 [-0.66, -0.62] -0.41 [-0.47, -0.34]
clarity.Q   0.14 [0.12, 0.16]    0.10 [0.04, 0.17]
clarity.C   -0.03 [-0.04, -0.01] -0.16 [-0.22, -0.10]
clarity^4   0.01 [-0.00, 0.03]   0.08 [0.03, 0.12]
clarity^5   0.07 [0.06, 0.08]    -0.14 [-0.17, -0.10]
clarity^6   -0.02 [-0.03, -0.01] 0.07 [0.04, 0.10]
clarity^7   0.01 [0.00, 0.02]    -0.01 [-0.03, 0.02]

Finally, once we are done using a cluster, we need to shut it down, which can be done using the stopCluster() command. Otherwise, even if we no longer use the cluster or workers, they will sit there and use system resources like memory.

> stopCluster(cl)

In closing, we have seen how using data management packages designed for large data and speed can have a dramatic impact in the time it takes to do data manipulation and management on larger datasets, although still requiring that they fit within memory. We have also seen how, for some problems, it is easy to gain great speed advantages through parallel processing when multiple cores are available. All the parallel processing examples in this chapter were explicit parallelization. That is, we processed variables or for loops in parallel. None of these approaches would help if you had a single regression model that was very slow to complete or matrix multiplication or decomposition that took a long time. Currently, there are not many R functions designed with implicit parallelization. One way such implicit parallelization can be achieved is by linking R to parallel linear algebra systems such as ATLAS or GoToBlas2, although this is a decidedly nonbeginning topic and outside the scope of this book. One easy way to get going with this is to use a modified version of R provided by Revolution Analytics (www.revolutionanalytics.com/revolution-r-open) which uses Intel Math Kernel Library for parallel processing of a variety of math operations, speeding up tasks such as matrix multiplication and principal component analysis.

References

Revolution Analytics. iterators: Iterator construct for R. R package version 1.0.7, 2014. http://CRAN.R-project.org/package=iterators.

Wickham, H. nycflights13: Data about flights departing NYC in 2013. R package version 0.1, 2014. http://CRAN.R-project.org/package=nycflights13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.78.237