Filtering missing data before or during the actual analysis

Let's suppose we want to calculate the mean of the actual length of flights:

> mean(hflights$ActualElapsedTime)
[1] NA

The result is NA of course, because as identified previously, this variable contains missing values, and almost every R operation with NA results in NA. So let's overcome this issue as follows:

> mean(hflights$ActualElapsedTime, na.rm = TRUE)
[1] 129.3237
> mean(na.omit(hflights$ActualElapsedTime))
[1] 129.3237

Any performance issues there? Or other means of deciding which method to use?

> library(microbenchmark)
> NA.RM   <- function()
+              mean(hflights$ActualElapsedTime, na.rm = TRUE)
> NA.OMIT <- function()
+              mean(na.omit(hflights$ActualElapsedTime))
> microbenchmark(NA.RM(), NA.OMIT())
Unit: milliseconds
      expr       min        lq    median        uq       max neval
   NA.RM()  7.105485  7.231737  7.500382  8.002941  9.850411   100
 NA.OMIT() 12.268637 12.471294 12.905777 13.376717 16.008637   100

The first glance at the performance of these options computed with the help of the microbenchmark package (please see the Loading text files of reasonable size section in the Chapter 1, Hello Data for more details) suggests that using na.rm is the better solution in case of a single function call.

On the other hand, if we want to reuse the data at some later phase in the analysis, it is more viable and effective to omit the missing values and observations only once from the dataset, instead of always specifying na.rm to be TRUE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.128.105