Missing data

One of the biggest problems in real-world data is missing data. In carefully planned experiments on inanimate chemicals, small samples of rats, or highly mechanized factories, missing data may not be such a problem. However, whenever a dataset gets large enough, or starts to involve humans, missing data is almost a certainty. Let's begin by pointing out that if you have missing data, then you have a missing data problem, and you have to do something with that missing data; the question is, what? The answer lies in what kind of bias you are dealing with as a result of missing data.

Computational aspects of missing data in R

Before we delve into the statistical aspects of missing data, we need to review the computational ones. There are at least two different kinds of data missing in R, and they are NA and Null. NA is a missing value, but there are multiple types of NA, and R will automatically coerce missing values to be what it thinks is the appropriate type. For example, in the cleaned.pumpkins data frame, we would expect NA to be of numeric type, because the column was coerced to a numeric vector type when we created it. Let's have a look at the following code:

> cleaned.pumpkins
  weight location
1    2.3   europe
2    2.4   europe
3    3.1       US
4    2.7       US
5     NA       US
> cleaned.pumpkins[5,1]
[1] NA
> typeof(cleaned.pumpkins[5,1])
[1] "double"

We can see here that R keeps in mind that the NA values in the first column are still of double data type. Let's compare this with a nonexistent value. Let's select the first row and third column (notice that the cleaned pumpkins dataset has no third column), and let's assign it to a variable, as follows:

> b <- cleaned.pumpkins[1,3]
> b
NULL
> typeof(b)
[1] "NULL"

We can see here that b does not contain an NA but rather contains a NULL value, and this NULL value has no type other than NULL itself. Programs in R that are designed to handle missing data with imputation operate on the NA operator rather than the NULL operator.

Statistical considerations of missing data

We speak of a particular variable in a dataset being missing, and there can be three kinds of missingness in that variable, as follows:

  • Missing completely at random (MCAR): This missingness has nothing to do with any variable relevant to the entire dataset.
  • Missing at random (MAR): This does not mean what it sounds like. The missingness is not due to the particular variable, but may be due to other variables in the dataset.
  • Missing not at random (MNAR): The missingness is due to the variable itself.

Data that is MCAR has no bias. For example, let's say you are carrying a stack of papers that describe survey responses in your study and drop them, then the wind blows half of them away. Which half? It is completely random—we would expect the half that remains to resemble the half that blew away, so there is no bias introduced by the mechanism of missingness.

Data that is MAR will have bias. For example, let's say that we have a study of age and income. Let's say that younger people simply ignore the survey in the mail—then we have missing income data, and this is missing at random (a terrible name), because it is an income data that is missing based on age and not on income, once age is controlled for. As you can imagine, this may introduce bias.

In our age and income study, if those with higher incomes opted not to report incomes, then it would be data that is MNAR. In this case, the missing income data is an effect of the income value. This will surely introduce bias, because in general income, estimates will be lower because of the missing data.

We can try to deal with missing data in one of two ways, as mentioned below:

  • Leave it out of our study and pretend it never existed (called deletion)
  • Make up some values that we think would be present if the data were not missing (called imputation)

If our data is MCAR, then the first option is the better option because ignoring missing data only gives us a smaller sample with no bias. Unfortunately, data is rarely MCAR, and choosing the first option may introduce bias. Believe it or not, if the data is MAR or even MNAR, then it is better to make up some values than to simply ignore missing values. However, imputation itself can introduce its own biases, so sometimes deletion is the only viable fallback option.

Deletion methods

If we are going to rely on ignoring missing values, then there are a number of different approaches to deletion that we can use. These include listwise and pairwise deletion.

Listwise deletion or complete case analysis

The simplest approach to handling missing data in R is listwise deletion, and this is often the fallback method for handling missing data. Listwise deletion is simply leaving out all individuals who do not have all relevant data elements available. This is probably the simplest and most common method for dealing with missing data. As stated earlier, this can cause biased data, and may decrease one's sample size significantly if a large number of individuals have only a single data element missing.

Listwise deletion can be accomplished in R with the complete.cases command as follows:

> library(mice)
> data(nhanes2)
> complete.cases(nhanes2)
 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
[13]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
[25]  TRUE

Pairwise deletion

Pairwise deletion ignores missing data like listwise deletion, but rather than outright excluding members who lack complete data on all variables, it makes use of the data that is observed for each member. For example, if in many statistical analyses, the computer creates a covariance matrix (for example, regression methods), each element in the covariance matrix comes from the covariance of two of the variables. Thus, with pairwise deletion, all available data for any of the two variables is used to compute their covariance. This helps to preserve sample sizes, but it has other problems. It may create bias as listwise deletion, and this means that the sample used to compute each element of a pairwise computation method can be slightly different, which can lead to problems like not being positive definite covariance matrices. Many R functions (or functions available in packages) offer a choice of listwise or pairwise deletion.

Tip

When is it appropriate to use deletion as a method of dealing with missing data?

While deletion is usually not the preferred method, in some cases it is. If less than five percent of your data is missing, this is probably an acceptable method (some might even say 10 percent). Likewise, if more than half of your data is missing, you may just want to acknowledge the bias created by ignoring missing values and do a complete case analysis so as to avoid analyzing a dataset that is mostly synthetic data.

Visualizing missing data

Let's start by loading the nhanes2 dataset from the mice package and summarizing the dataset.

This tells us that we certainly have missing values (denoted by NA), but what if we want to know a little more about the patterns of missingness? For example, what if we want to know what proportion of the sample has complete data, is missing data variables, and so on? The VIM package has convenient tools for this, such as an aggregation plot, as follows:

library(VIM)
aggr(nhanes2, numbers = TRUE, col = c('black', 'gray'))

The result is shown in the following plot:

Visualizing missing data

The preceding plot gives the proportion of missingness in each variable on the left, and then combinations of missing values on the right. In the previous example, black cells indicate complete data, and gray cells indicate missing data. The bottom row tells us that 52 percent of the sample has no missing data. The second to last row tells us that 28 percent of the population has a recorded age (as noted in black), but is missing the other three variables (as indicated in gray). The remaining eight percent of the sample has a missingness pattern given in the top two rows. Thus, if we were to do a complete case analysis, we would lose 28 percent of our (already small) sample.

Tip

The mi package contains the mp.plot function for missing data visualization, which displays a graphic. The mice packages contain the md.pattern function, which returns a table with the same data as is contained in the right panel of VIM's aggr function.

An overview of multiple imputation

Multiple imputation is the method of creating new data to fill in missing values, which we alluded to as an alternative to case deletion earlier. Here we will give a big picture discussion of imputation before delving into how it can be done in R.

Imputation basic principles

The alternative to ignoring missing data is to fill in the missing data with some educated guesses. As indicated previously, this is quite frankly making up data, a fact important to keep in mind. The success of imputation relies on our ability to make up data that informs the data analyst, rather than leading them astray. The general approach to imputation is to assume that there is some other information available in the data that can tell us what a missing value would have been if it were not missing.

The simplest imputation method is to fill in each missing data value with a single replacement value, a method called single imputation. With the advent of fast personal computers and statistical advances over the past few decades, this has largely been replaced by multiple imputation, in which each missing value is replaced by multiple substitutes, effectively yielding multiple datasets. The problem with single imputation is that it replaces an unknown value with a single certain value. Of course, we don't actually know the value of the missing data element, so it is a bit unfair to pretend that we know it with certainty, and as such single imputation will tend to minimize the variability of the real data. Multiple imputation replaces the single unknown value with a distribution of known values, retaining the appropriate uncertainty or variance in the dataset. (At least this is how the theory goes.)

Approaches to imputation

Imputation is active area of statistical research for which new methods are constantly being developed, and older methods improved upon. There are not necessarily universal rules for imputation, and the type of imputation chosen may depend on the type of analysis one wishes to do.

Broadly speaking, the two main approaches to imputation that have been used are a multivariate normally distributed approach and an imputation by chained equations approach. We will first go over the Amelia package, which assumes multivariate normality, and then the mice package based on chained equations with an imputation method designed for non-normal data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.237.131