By-passing missing values

So it seems that missing data relatively frequently occurs with the time-related variables, but we have no missing values among the flight identifiers and dates. On the other hand, if one value is missing for a flight, the chances are rather high that some other variables are missing as well – out of the overall number of 3,622 cases with at least one missing value:

> mean(cor(apply(hflights, 2, function(x)
+    as.numeric(is.na(x)))), na.rm = TRUE)
[1] 0.9589153
Warning message:
In cor(apply(hflights, 2, function(x) as.numeric(is.na(x)))) :
  the standard deviation is zero

Okay, let's see what we have done here! First, we have called the apply function to transform the values of data.frame to 0 or 1, where 0 stands for an observed, while 1 means a missing value. Then we computed the correlation coefficients of this newly created matrix, which of course returned a lot of missing values due to fact that some columns had only one unique value without any variability, as shown in the warning message. For this, we had to specify the na.rm parameter to be TRUE, so that the mean function would return a real value instead of an NA, by removing the missing values among the correlation coefficients returned by the cor function.

So one option is the heavy use of the na.rm argument, which is supported by most functions that are sensitive to missing data—to name a few from the base and stats packages: mean, median, sum, max and min.

To compile the complete list of functions that have the na.rm argument in the base package, we can follow the steps described in a very interesting SO answer located at http://stackoverflow.com/a/17423072/564164. I found this answer motivating because I truly believe in the power of analyzing the tools we use for analysis, or in other words, spending some time on understanding how R works in the background.

First, let's make a list of all the functions found in baseenv (the environment of the base package) along with the complete function arguments and body:

> Funs <- Filter(is.function, sapply(ls(baseenv()), get, baseenv()))

Then we can Filter all those functions from the returned list, which have na.rm among the formal arguments via the following:

> names(Filter(function(x)
+    any(names(formals(args(x))) %in% 'na.rm'), Funs))
 [1] "all"                     "any"                    
 [3] "colMeans"                "colSums"                
 [5] "is.unsorted"             "max"                    
 [7] "mean.default"            "min"                    
 [9] "pmax"                    "pmax.int"               
[11] "pmin"                    "pmin.int"               
[13] "prod"                    "range"                  
[15] "range.default"           "rowMeans"               
[17] "rowsum.data.frame"       "rowsum.default"         
[19] "rowSums"                 "sum"                    
[21] "Summary.data.frame"      "Summary.Date"           
[23] "Summary.difftime"        "Summary.factor"         
[25] "Summary.numeric_version" "Summary.ordered"        
[27] "Summary.POSIXct"         "Summary.POSIXlt"  

This can be easily applied to any R package by changing the environment variable to for example 'package:stats' in the case of the stats package:

> names(Filter(function(x)
+   any(names(formals(args(x))) %in% 'na.rm'),
+     Filter(is.function,
+       sapply(ls('package:stats'), get, 'package:stats'))))
 [1] "density.default" "fivenum"         "heatmap"        
 [4] "IQR"             "mad"             "median"         
 [7] "median.default"  "medpolish"       "sd"             
[10] "var"                

So these are the functions that have the na.rm argument in the base and the stats packages, where we have seen that the fastest and easiest way of ignoring missing values in single function calls (without actually removing the NA values from the dataset) is setting na.rm to TRUE. But why doesn't na.rm default to TRUE?

Overriding the default arguments of a function

If you are annoyed by the fact that most functions return NA if your R object includes missing values, then you can override those by using some custom wrapper functions, such as:

> myMean <- function(...) mean(..., na.rm = TRUE)
> mean(c(1:5, NA))
[1] NA
> myMean(c(1:5, NA))
[1] 3

Another option might be to write a custom package which would override the factory defaults of the base and stats function, like in the rapportools package, which includes miscellaneous helper functions with sane defaults for reporting:

> library(rapportools)
Loading required package: reshape

Attaching package: 'rapportools'

The following objects are masked from 'package:stats':

    IQR, median, sd, var

The following objects are masked from 'package:base':

    max, mean, min, range, sum

> mean(c(1:5, NA))
[1] 3

The problem with this approach is that you've just permanently overridden those functions listed, so you'll need to restart your R session or detach the rapportools package to reset to the standard arguments, like:

> detach('package:rapportools')
> mean(c(1:5, NA))
[1] NA

A more general solution to override the default arguments of a function is to rely on some nifty features of the Defaults package, which is although not under active maintenance, but it does the job:

> library(Defaults)
> setDefaults(mean.default, na.rm = TRUE)
> mean(c(1:5, NA))
[1] 3

Please note that here we had to update the default argument value of mean.default instead of simply trying to tweak mean, as that latter would result in an error:

> setDefaults(mean, na.rm = TRUE)
Warning message:
In setDefaults(mean, na.rm = TRUE) :
  'na.rm' was not set, possibly not a formal arg for 'mean'

This is due to the fact that mean is an S3 method without any formal arguments:

> mean
function (x, ...) 
{
    if (exists(".importDefaults")) 
        .importDefaults(calling.fun = "mean")
    UseMethod("mean")
}
<environment: namespace:base>
> formals(mean)
$x

$...

Either methods you prefer, you can automatically call those functions when R starts by adding a few lines of code in your Rprofile file.

Note

You can customize the R environment via a global or user-specific Rprofile file. This is a normal R script which is usually placed in the user's home directory with a leading dot in the file name, which is run every time a new R session is started. There you can call any R functions wrapped in the .First or .Last functions to be run at the start or at the end of the R session. Such useful additions might be loading some R packages, printing custom greetings or KPI metrics from a database, or for example installing the most recent versions of all R packages.

But it's probably better not to tweak your R environment in such a non-standard way, as you might soon experience some esoteric and unexpected errors or silent malfunctions in your analysis.

For example, I've got used to working in a temporary directory at all times by specifying setwd('/tmp') in my Rprofile, which is very useful if you start R sessions frequently for some quick jobs. On the other hand, it's really frustrating to spend 15 minutes of your life debugging why some random R function does not seem to do its job, and why it's returning some file not found error messages instead.

So please be warned: if you update the factory default arguments of R functions, do not ever think of ranting about some new bugs you have found in some major functions of base R on the R mailing lists, before trying to reproduce those errors in a vanilla R session with starting R with the --vanilla command line option.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.255.87