Let's suppose we want to calculate the mean
of the actual length of flights:
> mean(hflights$ActualElapsedTime) [1] NA
The result is NA
of course, because as identified previously, this variable contains missing values, and almost every R operation with NA
results in NA
. So let's overcome this issue as follows:
> mean(hflights$ActualElapsedTime, na.rm = TRUE) [1] 129.3237 > mean(na.omit(hflights$ActualElapsedTime)) [1] 129.3237
Any performance issues there? Or other means of deciding which method to use?
> library(microbenchmark) > NA.RM <- function() + mean(hflights$ActualElapsedTime, na.rm = TRUE) > NA.OMIT <- function() + mean(na.omit(hflights$ActualElapsedTime)) > microbenchmark(NA.RM(), NA.OMIT()) Unit: milliseconds expr min lq median uq max neval NA.RM() 7.105485 7.231737 7.500382 8.002941 9.850411 100 NA.OMIT() 12.268637 12.471294 12.905777 13.376717 16.008637 100
The first glance at the performance of these options computed with the help of the microbenchmark
package (please see the Loading text files of reasonable size section in the Chapter 1, Hello Data for more details) suggests that using na.rm
is the better solution in case of a single function call.
On the other hand, if we want to reuse the data at some later phase in the analysis, it is more viable and effective to omit the missing values and observations only once from the dataset, instead of always specifying na.rm
to be TRUE
.
3.143.17.27