So it seems that missing data relatively frequently occurs with the time-related variables, but we have no missing values among the flight identifiers and dates. On the other hand, if one value is missing for a flight, the chances are rather high that some other variables are missing as well – out of the overall number of 3,622 cases with at least one missing value:
> mean(cor(apply(hflights, 2, function(x) + as.numeric(is.na(x)))), na.rm = TRUE) [1] 0.9589153 Warning message: In cor(apply(hflights, 2, function(x) as.numeric(is.na(x)))) : the standard deviation is zero
Okay, let's see what we have done here! First, we have called the apply
function to transform the values of data.frame
to 0
or 1
, where 0
stands for an observed, while 1
means a missing value. Then we computed the correlation coefficients of this newly created matrix, which of course returned a lot of missing values due to fact that some columns had only one unique value without any variability, as shown in the warning message. For this, we had to specify the na.rm
parameter to be TRUE
, so that the mean
function would return a real value instead of an NA
, by removing the missing values among the correlation coefficients returned by the cor
function.
So one option is the heavy use of the na.rm
argument, which is supported by most functions that are sensitive to missing data—to name a few from the base
and stats
packages: mean
, median
, sum
, max
and min
.
To compile the complete list of functions that have the na.rm
argument in the base package, we can follow the steps described in a very interesting SO answer located at http://stackoverflow.com/a/17423072/564164. I found this answer motivating because I truly believe in the power of analyzing the tools we use for analysis, or in other words, spending some time on understanding how R works in the background.
First, let's make a list of all the functions found in baseenv
(the environment of the base
package) along with the complete function arguments and body:
> Funs <- Filter(is.function, sapply(ls(baseenv()), get, baseenv()))
Then we can Filter
all those functions from the returned list, which have na.rm
among the formal arguments via the following:
> names(Filter(function(x) + any(names(formals(args(x))) %in% 'na.rm'), Funs)) [1] "all" "any" [3] "colMeans" "colSums" [5] "is.unsorted" "max" [7] "mean.default" "min" [9] "pmax" "pmax.int" [11] "pmin" "pmin.int" [13] "prod" "range" [15] "range.default" "rowMeans" [17] "rowsum.data.frame" "rowsum.default" [19] "rowSums" "sum" [21] "Summary.data.frame" "Summary.Date" [23] "Summary.difftime" "Summary.factor" [25] "Summary.numeric_version" "Summary.ordered" [27] "Summary.POSIXct" "Summary.POSIXlt"
This can be easily applied to any R package by changing the environment variable to for example 'package:stats'
in the case of the stats
package:
> names(Filter(function(x) + any(names(formals(args(x))) %in% 'na.rm'), + Filter(is.function, + sapply(ls('package:stats'), get, 'package:stats')))) [1] "density.default" "fivenum" "heatmap" [4] "IQR" "mad" "median" [7] "median.default" "medpolish" "sd" [10] "var"
So these are the functions that have the na.rm
argument in the base
and the stats
packages, where we have seen that the fastest and easiest way of ignoring missing values in single function calls (without actually removing the NA
values from the dataset) is setting na.rm
to TRUE
. But why doesn't na.rm
default to TRUE
?
If you are annoyed by the fact that most functions return NA
if your R object includes missing values, then you can override those by using some custom wrapper functions, such as:
> myMean <- function(...) mean(..., na.rm = TRUE) > mean(c(1:5, NA)) [1] NA > myMean(c(1:5, NA)) [1] 3
Another option might be to write a custom package which would override the factory defaults of the base
and stats
function, like in the rapportools
package, which includes miscellaneous helper functions with sane defaults for reporting:
> library(rapportools) Loading required package: reshape Attaching package: 'rapportools' The following objects are masked from 'package:stats': IQR, median, sd, var The following objects are masked from 'package:base': max, mean, min, range, sum > mean(c(1:5, NA)) [1] 3
The problem with this approach is that you've just permanently overridden those functions listed, so you'll need to restart your R session or detach the rapportools
package to reset to the standard arguments, like:
> detach('package:rapportools') > mean(c(1:5, NA)) [1] NA
A more general solution to override the default arguments of a function is to rely on some nifty features of the Defaults
package, which is although not under active maintenance, but it does the job:
> library(Defaults) > setDefaults(mean.default, na.rm = TRUE) > mean(c(1:5, NA)) [1] 3
Please note that here we had to update the default argument value of mean.default
instead of simply trying to tweak mean
, as that latter would result in an error:
> setDefaults(mean, na.rm = TRUE) Warning message: In setDefaults(mean, na.rm = TRUE) : 'na.rm' was not set, possibly not a formal arg for 'mean'
This is due to the fact that mean
is an S3
method without any formal arguments:
> mean function (x, ...) { if (exists(".importDefaults")) .importDefaults(calling.fun = "mean") UseMethod("mean") } <environment: namespace:base> > formals(mean) $x $...
Either methods you prefer, you can automatically call those functions when R starts by adding a few lines of code in your Rprofile
file.
You can customize the R environment via a global or user-specific Rprofile
file. This is a normal R script which is usually placed in the user's home directory with a leading dot in the file name, which is run every time a new R session is started. There you can call any R functions wrapped in the .First
or .Last
functions to be run at the start or at the end of the R session. Such useful additions might be loading some R packages, printing custom greetings or KPI metrics from a database, or for example installing the most recent versions of all R packages.
But it's probably better not to tweak your R environment in such a non-standard way, as you might soon experience some esoteric and unexpected errors or silent malfunctions in your analysis.
For example, I've got used to working in a temporary directory at all times by specifying setwd('/tmp')
in my Rprofile
, which is very useful if you start R sessions frequently for some quick jobs. On the other hand, it's really frustrating to spend 15 minutes of your life debugging why some random R function does not seem to do its job, and why it's returning some file not found error messages instead.
So please be warned: if you update the factory default arguments of R functions, do not ever think of ranting about some new bugs you have found in some major functions of base R on the R mailing lists, before trying to reproduce those errors in a vanilla R session with starting R with the --vanilla
command line option.
13.58.220.83