One of the biggest problems in real-world data is missing data. In carefully planned experiments on inanimate chemicals, small samples of rats, or highly mechanized factories, missing data may not be such a problem. However, whenever a dataset gets large enough, or starts to involve humans, missing data is almost a certainty. Let's begin by pointing out that if you have missing data, then you have a missing data problem, and you have to do something with that missing data; the question is, what? The answer lies in what kind of bias you are dealing with as a result of missing data.
Before we delve into the statistical aspects of missing data, we need to review the computational ones. There are at least two different kinds of data missing in R, and they are NA and Null. NA is a missing value, but there are multiple types of NA, and R will automatically coerce missing values to be what it thinks is the appropriate type. For example, in the cleaned.pumpkins
data frame, we would expect NA to be of numeric type, because the column was coerced to a numeric vector type when we created it. Let's have a look at the following code:
> cleaned.pumpkins weight location 1 2.3 europe 2 2.4 europe 3 3.1 US 4 2.7 US 5 NA US > cleaned.pumpkins[5,1] [1] NA > typeof(cleaned.pumpkins[5,1]) [1] "double"
We can see here that R keeps in mind that the NA
values in the first column are still of double data type. Let's compare this with a nonexistent value. Let's select the first row and third column (notice that the cleaned pumpkins dataset has no third column), and let's assign it to a variable, as follows:
> b <- cleaned.pumpkins[1,3] > b NULL > typeof(b) [1] "NULL"
We can see here that b
does not contain an NA
but rather contains a NULL
value, and this NULL
value has no type other than NULL
itself. Programs in R that are designed to handle missing data with imputation operate on the NA
operator rather than the NULL
operator.
We speak of a particular variable in a dataset being missing, and there can be three kinds of missingness in that variable, as follows:
Data that is MCAR has no bias. For example, let's say you are carrying a stack of papers that describe survey responses in your study and drop them, then the wind blows half of them away. Which half? It is completely random—we would expect the half that remains to resemble the half that blew away, so there is no bias introduced by the mechanism of missingness.
Data that is MAR will have bias. For example, let's say that we have a study of age and income. Let's say that younger people simply ignore the survey in the mail—then we have missing income data, and this is missing at random (a terrible name), because it is an income data that is missing based on age and not on income, once age is controlled for. As you can imagine, this may introduce bias.
In our age and income study, if those with higher incomes opted not to report incomes, then it would be data that is MNAR. In this case, the missing income data is an effect of the income value. This will surely introduce bias, because in general income, estimates will be lower because of the missing data.
We can try to deal with missing data in one of two ways, as mentioned below:
If our data is MCAR, then the first option is the better option because ignoring missing data only gives us a smaller sample with no bias. Unfortunately, data is rarely MCAR, and choosing the first option may introduce bias. Believe it or not, if the data is MAR or even MNAR, then it is better to make up some values than to simply ignore missing values. However, imputation itself can introduce its own biases, so sometimes deletion is the only viable fallback option.
If we are going to rely on ignoring missing values, then there are a number of different approaches to deletion that we can use. These include listwise and pairwise deletion.
The simplest approach to handling missing data in R is listwise deletion, and this is often the fallback method for handling missing data. Listwise deletion is simply leaving out all individuals who do not have all relevant data elements available. This is probably the simplest and most common method for dealing with missing data. As stated earlier, this can cause biased data, and may decrease one's sample size significantly if a large number of individuals have only a single data element missing.
Listwise deletion can be accomplished in R with the complete.cases
command as follows:
> library(mice) > data(nhanes2) > complete.cases(nhanes2) [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE [13] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE [25] TRUE
Pairwise deletion ignores missing data like listwise deletion, but rather than outright excluding members who lack complete data on all variables, it makes use of the data that is observed for each member. For example, if in many statistical analyses, the computer creates a covariance matrix (for example, regression methods), each element in the covariance matrix comes from the covariance of two of the variables. Thus, with pairwise deletion, all available data for any of the two variables is used to compute their covariance. This helps to preserve sample sizes, but it has other problems. It may create bias as listwise deletion, and this means that the sample used to compute each element of a pairwise computation method can be slightly different, which can lead to problems like not being positive definite covariance matrices. Many R functions (or functions available in packages) offer a choice of listwise or pairwise deletion.
When is it appropriate to use deletion as a method of dealing with missing data?
While deletion is usually not the preferred method, in some cases it is. If less than five percent of your data is missing, this is probably an acceptable method (some might even say 10 percent). Likewise, if more than half of your data is missing, you may just want to acknowledge the bias created by ignoring missing values and do a complete case analysis so as to avoid analyzing a dataset that is mostly synthetic data.
Let's start by loading the nhanes2
dataset from the mice
package and summarizing the dataset.
This tells us that we certainly have missing values (denoted by NA
), but what if we want to know a little more about the patterns of missingness? For example, what if we want to know what proportion of the sample has complete data, is missing data variables, and so on? The VIM
package has convenient tools for this, such as an aggregation plot, as follows:
library(VIM) aggr(nhanes2, numbers = TRUE, col = c('black', 'gray'))
The result is shown in the following plot:
The preceding plot gives the proportion of missingness in each variable on the left, and then combinations of missing values on the right. In the previous example, black cells indicate complete data, and gray cells indicate missing data. The bottom row tells us that 52 percent of the sample has no missing data. The second to last row tells us that 28 percent of the population has a recorded age (as noted in black), but is missing the other three variables (as indicated in gray). The remaining eight percent of the sample has a missingness pattern given in the top two rows. Thus, if we were to do a complete case analysis, we would lose 28 percent of our (already small) sample.
Multiple imputation is the method of creating new data to fill in missing values, which we alluded to as an alternative to case deletion earlier. Here we will give a big picture discussion of imputation before delving into how it can be done in R.
The alternative to ignoring missing data is to fill in the missing data with some educated guesses. As indicated previously, this is quite frankly making up data, a fact important to keep in mind. The success of imputation relies on our ability to make up data that informs the data analyst, rather than leading them astray. The general approach to imputation is to assume that there is some other information available in the data that can tell us what a missing value would have been if it were not missing.
The simplest imputation method is to fill in each missing data value with a single replacement value, a method called single imputation. With the advent of fast personal computers and statistical advances over the past few decades, this has largely been replaced by multiple imputation, in which each missing value is replaced by multiple substitutes, effectively yielding multiple datasets. The problem with single imputation is that it replaces an unknown value with a single certain value. Of course, we don't actually know the value of the missing data element, so it is a bit unfair to pretend that we know it with certainty, and as such single imputation will tend to minimize the variability of the real data. Multiple imputation replaces the single unknown value with a distribution of known values, retaining the appropriate uncertainty or variance in the dataset. (At least this is how the theory goes.)
Imputation is active area of statistical research for which new methods are constantly being developed, and older methods improved upon. There are not necessarily universal rules for imputation, and the type of imputation chosen may depend on the type of analysis one wishes to do.
Broadly speaking, the two main approaches to imputation that have been used are a multivariate normally distributed approach and an imputation by chained equations approach. We will first go over the Amelia
package, which assumes multivariate normality, and then the mice
package based on chained equations with an imputation method designed for non-normal data.
18.223.237.131