Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11. Dealing with Messy Data

As mentioned in the last chapter, analyzing data in the real world often requires some know-how outside of the typical introductory data analysis curriculum. For example, rarely do we get a neatly formatted, tidy dataset with no errors, junk, or missing values. Rather, we often get messy, unwieldy datasets.

What makes a dataset messy? Different people in different roles have different ideas about what constitutes messiness. Some regard any data that invalidates the assumptions of the parametric model as messy. Others see messiness in datasets with a grievously imbalanced number of observations in each category for a categorical variable. Some examples of things that I would consider messy are:

Many missing values (NAs)
Misspelled names in categorical variables
Inconsistent data coding
Numbers in the same column being in different units
Mis-recorded data and data entry mistakes
Extreme outliers

Since there are an infinite number of ways that data can be messy, there's simply no chance of enumerating every example and their respective solutions. Instead, we are going to talk about two tools that help combat the bulk of the messiness issues that I cited just now.

Analysis with missing data

Missing data is another one of those topics that are largely ignored in most introductory texts. Probably, part of the reason why this is the case is that many myths about analysis with missing data still abound. Additionally, some of the research into cutting-edge techniques is still relatively new. A more legitimate reason for its absence in introductory texts is that most of the more principled methodologies are fairly complicated—mathematically speaking. Nevertheless, the incredible ubiquity of problems related to missing data in real life data analysis necessitates some broaching of the subject. This section serves as a gentle introduction into the subject and one of the more effective techniques for dealing with it.

A common refrain on the subject is something along the lines of the best way to deal with missing data is not to have any. It's true that missing data is a messy subject, and there are a lot of ways to do it wrong. It's important not to take this advice to the extreme, though. In order to bypass missing data problems, some have disallowed survey participants, for example, to go on without answering all the questions on a form. You can coerce the participants in a longitudinal study to not drop out, too. Don't do this. Not only is it unethical, it is also prodigiously counter-productive; there are treatments for missing data, but there are no treatments for bad data.

The standard treatment to the problem of missing data is to replace the missing data with non-missing values. This process is called imputation. In most cases, the goal of imputation is not to recreate the lost completed dataset but to allow valid statistical estimates or inferences to be drawn from incomplete data. Because of this, the effectiveness of different imputation techniques can't be evaluated by their ability to most accurately recreate the data from a simulated missing dataset; they must, instead, be judged by their ability to support the same statistical inferences as would be drawn from the analysis on the complete data. In this way, filling in the missing data is only a step towards the real goal—the analysis. The imputed dataset is rarely considered the final goal of imputation.

There are many different ways that missing data is dealt with in practice—some are good, some are not so good. Some are okay under certain circumstances, but not okay in others. Some involve missing data deletion, while some involve imputation. We will briefly touch on some of the more common methods. The ultimate goal of this chapter, though, is to get you started on what is often described as the gold-standard of imputation techniques: multiple imputation.

Visualizing missing data

In order to demonstrate the visualizing patterns of missing data, we first have to create some missing data. This will also be the same dataset that we perform analysis on later in the chapter. To showcase how to use multiple imputation for a semi-realistic scenario, we are going to create a version of the mtcars dataset with a few missing values:

Okay, let's set the seed (for deterministic randomness), and create a variable to hold our new marred dataset.

set.seed(2)
miss_mtcars <- mtcars

First, we are going to create seven missing values in drat (about 20 percent), five missing values in the mpg column (about 15 percent), five missing values in the cyl column, three missing values in wt (about 10 percent), and three missing values in vs:

some_rows <- sample(1:nrow(miss_mtcars), 7)
miss_mtcars$drat[some_rows] <- NA

some_rows <- sample(1:nrow(miss_mtcars), 5)
miss_mtcars$mpg[some_rows] <- NA

some_rows <- sample(1:nrow(miss_mtcars), 5)
miss_mtcars$cyl[some_rows] <- NA

some_rows <- sample(1:nrow(miss_mtcars), 3)
miss_mtcars$wt[some_rows] <- NA

some_rows <- sample(1:nrow(miss_mtcars), 3)
miss_mtcars$vs[some_rows] <- NA

Now, we are going to create four missing values in qsec, but only for automatic cars:

only_automatic <- which(miss_mtcars$am==0)
some_rows <- sample(only_automatic, 4)
miss_mtcars$qsec[some_rows] <- NA

Now, let's take a look at the dataset:

> miss_mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85    NA 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110   NA 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175   NA 3.440 17.02  0  0    3    2
Valiant             18.1  NA 225.0 105   NA 3.460    NA  1  0    3    1

Great, now let's visualize the missingness.

The first way we are going to visualize the pattern of missing data is by using the md.pattern function from the mice package (which is also the package that we are ultimately going to use for imputing our missing data). If you don't have the package already, install it.

> library(mice)
> md.pattern(miss_mtcars)
   disp hp am gear carb wt vs qsec mpg cyl drat   
12    1  1  1    1    1  1  1    1   1   1    1  0
 4    1  1  1    1    1  1  1    1   0   1    1  1
 2    1  1  1    1    1  1  1    1   1   0    1  1
 3    1  1  1    1    1  1  1    1   1   1    0  1
 3    1  1  1    1    1  0  1    1   1   1    1  1
 2    1  1  1    1    1  1  1    0   1   1    1  1
 1    1  1  1    1    1  1  1    1   0   1    0  2
 1    1  1  1    1    1  1  1    0   1   0    1  2
 1    1  1  1    1    1  1  0    1   1   0    1  2
 2    1  1  1    1    1  1  0    1   1   1    0  2
 1    1  1  1    1    1  1  1    0   1   0    0  3
      0  0  0    0    0  3  3    4   5   5    7 27

A row-wise missing data pattern refers to the columns that are missing for each row. This function aggregates and counts the number of rows with the same missing data pattern. This function outputs a binary (0 and 1) matrix. Cells with a 1 represent non-missing data; 0s represent missing data. Since the rows are sorted in an increasing-amount-of-missingness order, the first row always refers to the missing data pattern containing the least amount of missing data.

In this case, the missing data pattern with the least amount of missing data is the pattern containing no missing data at all. Because of this, the first row has all 1s in the columns that are named after the columns in the miss_mtcars dataset. The left-most column is a count of the number of rows that display the missing data pattern, and the right-most column is a count of the number of missing data points in that pattern. The last row contains a count of the number of missing data points in each column.

As you can see, 12 of the rows contain no missing data. The next most common missing data pattern is the one with missing just mpg; four rows fit this pattern. There are only six rows that contain more than one missing value. Only one of these rows contains more than two missing values (as shown in the second-to-last row).

As far as datasets with missing data go, this particular one doesn't contain much. It is not uncommon for some datasets to have more than 30 percent of its data missing. This data set doesn't even hit 3 percent.

Now let's visualize the missing data pattern graphically using the VIM package. You will probably have to install this, too.

library(VIM)
aggr(miss_mtcars, numbers=TRUE)

Figure 11.1: The output of VIM's visual aggregation of missing data. The left plot shows the proportion on missing values for each column. The right plot depicts the prevalence of row-wise missing data patterns, like md.pattern

At a glance, this representation shows us, effortlessly, that the drat column accounts for the highest proportion of missingness, column-wise, followed by mpg, cyl, qsec, vs, and wt. The graphic on the right shows us information similar to that of the output of md.pattern. This representation, though, makes it easier to tell if there is some systematic pattern of missingness. The blue cells represent non-missing data, and the red cells represent missing data. The numbers on the right of the graphic represent the proportion of rows displaying that missing data pattern. 37.5 percent of the rows contain no missing data whatsoever.

Types of missing data

The VIM package allowed us to visualize the missing data patterns. A related term, the missing data mechanism, describes the process that determines each data point's likelihood of being missing. There are three main categories of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Discrimination based on missing data mechanism is crucial, since it informs us about the options for handling the missingness.

The first mechanism, MCAR, occurs when data's missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor's survey forms off into the horizon. The mechanism that governs the missingness of drat, mpg, cyl, wt, and vs' is MCAR, because we randomly selected elements to go missing. This mechanism, while being the easiest to work with, is seldom tenable in practice.

MNAR, on the other hand, occurs when a variable's missingness is related to the variable itself. For example, suppose the scale that weighed each car had a capacity of only 3,700 pounds, and because of this, the eight cars that weighed more than that were recorded as NA. This is a classic example of the MNAR mechanism—it is the weight of the observation itself that is the cause for its being missing. Another example would be if during the course of trial of an anti-depressant drug, participants who were not being helped by the drug became too depressed to continue with the trial. At the end of the trial, when all the participants' level of depression is accessed and recorded, there would be missing values for participants whose reason for absence is related to their level of depression.

The last mechanism, missing at random, is somewhat unfortunately named. Contrary to what it may sound like, it means there is a systematic relationship between the missingness of an outcome variable' and other observed variables, but not the outcome variable itself. This is probably best explained by the following example.

Suppose that in a survey, there is a question about income level that, in its wording, uses a particular colloquialism. Due to this, a large number of the participants in the survey whose native language is not English couldn't interpret the question, and left it blank. If the survey collected just the name, gender, and income, the missing data mechanism of the question on income would be MNAR. If, however, the questionnaire included a question that asked if the participant spoke English as a first language, then the mechanism would be MAR. The inclusion of the Is English your first language? variable means that the missingness of the income question can be completely accounted for. The reason for the moniker missing at random is that when you control the relationship between the missing variable and the observed variable(s) it is related to (for example, What is your income? and Is English your first language? respectively), the data are missing at random.

As another example, there is a systematic relationship between the am and qsec variables in our simulated missing dataset: qsecs were missing only for automatic cars. But within the group of automatic cars, the qsec variable is missing at random. Therefore, qsec 's mechanism is MAR; controlling for transmission type, qsec is missing at random. Bear in mind, though, if we removed am from our simulated dataset, qsec would become MNAR.

As mentioned earlier, MCAR is the easiest type to work with because of the complete absence of a systematic relationship in the data's missingness. Many unsophisticated techniques for handling missing data rest on the assumption that the data are MCAR. On the other hand, MNAR data is the hardest to work with since the properties of the missing data that caused its missingness has to be understood quantifiably, and included in the imputation model. Though multiple imputations can handle the MNAR mechanisms, the procedures involved become more complicated and far beyond the scope of this text. The MCAR and MAR mechanisms allow us not to worry about the properties and parameters of the missing data. For this reason, may sometimes find MCAR or MAR missingness being referred to as ignorable missingness.

MAR data is not as hard to work with as MNAR data, but it is not as forgiving as MCAR. For this reason, though our simulated dataset contains MCAR and MAR components, the mechanism that describes the whole data is MAR—just one MAR mechanism makes the whole dataset MAR.

So which one is it?

You may have noticed that the place of a particular dataset in the missing data mechanism taxonomy is dependent on the variables that it includes. For example, we know that the mechanism behind qsec is MAR, but if the dataset did not include am, it would be MNAR. Since we are the ones that created the data, we know the procedure that resulted in qsec 's missing values. If we weren't the ones creating the data—as happens in the real world—and the dataset did not contain the am column, we would just see a bunch of arbitrarily missing qsec values. This might lead us to believe that the data is MCAR. It isn't, though; just because the variable to which another variable's missingness is systematically related is non-observed, doesn't mean that it doesn't exist.

This raises a critical question: can we ever be sure that our data is not MNAR? The unfortunate answer is no. Since the data that we need to prove or disprove MNAR is ipso facto missing, the MNAR assumption can never be conclusively disconfirmed. It's our job, as critically thinking data analysts, to ask whether there is likely an MNAR mechanism or not.

Unsophisticated methods for dealing with missing data

Here we are going to look at various types of methods for dealing with missing data:

Complete case analysis

This method, also called list-wise deletion, is a straightforward procedure that simply removes all rows or elements containing missing values prior to the analysis. In the univariate case—taking the mean of the drat column, for example—all elements of drat that are missing would simply be removed:

> mean(miss_mtcars$drat)
[1] NA
> mean(miss_mtcars$drat, na.rm=TRUE)
[1] 3.63

In a multivariate procedure—for example, linear regression predicting mpg from am, wt, and qsec—all rows that have a missing value in any of the columns included in the regression are removed:

listwise_model <- lm(mpg ~ am + wt + qsec,
                     data=miss_mtcars,
                     na.action = na.omit)
## OR
# complete.cases returns a boolean vector
comp <- complete.cases(cbind(miss_mtcars$mpg,
                             miss_mtcars$am,
                             miss_mtcars$wt,
                             miss_mtcars$qsec))
comp_mtcars <- mtcars[comp,]
listwise_model <- lm(mpg ~ am + wt + qsec,
                     data=comp_mtcars)

Under an MCAR mechanism, a complete case analysis produces unbiased estimates of the mean, variance/standard deviation, and regression coefficients, which means that the estimates don't systematically differ from the true values on average, since the included data elements are just a random sampling of the recorded data elements. However, inference-wise, since we lost a number of our samples, we are going to lose statistical power and generate standard errors and confidence intervals that are bigger than they need to be. Additionally, in the multivariate regression case, note that our sample size depends on the variables that we include in the regression; more the variables, more is the missing data that we open ourselves up to, and more the rows that we are liable to lose. This makes comparing results across different models slightly hairy.

Under an MAR or MNAR mechanism, list-wise deletion will produce biased estimates of the mean and variance. For example, if am were highly correlated with qsec, the fact that we are missing qsec only for automatic cars would significantly shift our estimates of the mean of qsec. Surprisingly, list-wise deletion produces unbiased estimates of the regression coefficients, even if the data is MNAR or MAR, as long as the relevant variables are included in the regression equations. For this reason, if there are relatively few missing values in a data set that is to be used in regression analysis, list-wise deletion could be an acceptable alternative to more principled approaches.

Pairwise deletion

Also called available-case analysis, this technique is (somewhat unfortunately) common when estimating covariance or correlation matrices. For each pair of variables, it only uses the cases that are non-missing for both. This often means that the number of elements used will vary from cell to cell of the covariance/correlation matrices. This can result in absurd correlation coefficients that are above 1, making the resulting matrices largely useless to methodologies that depend on them.

Mean substitution

Mean substitution, as the name suggests, replaces all the missing values with the mean of the available cases. For example:

mean_sub <- miss_mtcars
mean_sub$qsec[is.na(mean_sub$qsec)] <- mean(mean_sub$qsec,
                                            na.rm=TRUE)
# etc...

Although this seemingly solves the problem of the loss of sample size in the list-wise deletion procedure, mean substitution has some very unsavory properties of it's own. Whilst mean substitution produces unbiased estimates of the mean of a column, it produces biased estimates of the variance, since it removes the natural variability that would have occurred in the missing values had they not been missing. The variance estimates from mean substitution will therefore be, systematically, too small. Additionally, it's not hard to see that mean substitution will result in biased estimates if the data are MAR or MNAR. For these reasons, mean substitution is not recommended under virtually any circumstance.

Hot deck imputation

Hot deck imputation is an intuitively elegant approach that fills in the missing data with donor values from another row in the dataset. In the least sophisticated formulation, a random non-missing element from the same dataset is shared with a missing value. In more sophisticated hot deck approaches, the donor value comes from a row that is similar to the row with the missing data. The multiple imputation techniques that we will be using in a later section of this chapter borrows this idea for one of its imputation methods.

Note

The term hot deck refers to the old practice of storing data in decks of punch cards. The deck that holds the donor value would be hot because it is the one that is currently being processed.

Regression imputation

This approach attempts to fill in the missing data in a column using regression to predict likely values of the missing elements using other columns as predictors. For example, using regression imputation on the drat column would employ a linear regression predicting drat from all the other columns in miss_mtcars. The process would be repeated for all columns containing missing data, until the dataset is complete.

This procedure is intuitively appealing, because it integrates knowledge of the other variables and patterns of the dataset. This creates a set of more informed imputations. As a result, this produces unbiased estimates of the mean and regression coefficients under MCAR and MAR (so long as the relevant variables are included in the regression model.

However, this approach is not without its problems. The predicted values of the missing data lie right on the regression line but, as we know, very few data points lie right on the regression line—there is usually a normally distributed residual (error) term. Due to this, regression imputation underestimates the variability of the missing values. As a result, it will result in biased estimates of the variance and covariance between different columns. However, we're on the right track.

Stochastic regression imputation

As far as unsophisticated approaches go, stochastic regression is fairly evolved. This approach solves some of the issues of regression imputation, and produces unbiased estimates of the mean, variance, covariance, and regression coefficients under MCAR and MAR. It does this by adding a random (stochastic) value to the predictions of regression imputation. This random added value is sampled from the residual (error) distribution of the linear regression—which, if you remember, is assumed to be a normal distribution. This restores the variability in the missing values (that we lost in regression imputation) that those values would have had if they weren't missing.

However, as far as subsequent analysis and inference on the imputed dataset goes, stochastic regression results in standard errors and confidence intervals that are smaller than they should be. Since it produces only one imputed dataset, it does not capture the extent to which we are uncertain about the residuals and our coefficient estimates. Nevertheless, stochastic regression forms the basis of still more sophisticated imputation methods.

There are two sophisticated, well-founded, and recommended methods of dealing with missing data. One is called the Expectation Maximization (EM) method, which we do not cover here. The second is called Multiple Imputation, and because it is widely considered the most effective method, it is the one we explore in this chapter.

Multiple imputation

The big idea behind multiple imputation is that instead of generating one set of imputed data with our best estimation of the missing data, we generate multiple versions of the imputed data where the imputed values are drawn from a distribution. The uncertainty about what the imputed values should be is reflected in the variation between the multiply imputed datasets.

We perform our intended analysis separately with each of these m amount of completed datasets. These analyses will then yield m different parameter estimates (like regression coefficients, and so on). The critical point is that these parameter estimates are different solely due to the variability in the imputed missing values, and hence, our uncertainty about what the imputed values should be. This is how multiple imputation integrates uncertainty, and outperforms more limited imputation methods that produce one imputed dataset, conferring an unwarranted sense of confidence in the filled-in data of our analysis. The following diagram illustrates this idea:

Figure 11.2: Multiple imputation in a nutshell

So how does mice come up with the imputed values?

Let's focus on the univariate case—where only one column contains missing data and we use all the other (completed) columns to impute the missing values—before generalizing to a multivariate case.

mice actually has a few different imputation methods up its sleeve, each best suited for a particular use case. mice will often choose sensible defaults based on the data type (continuous, binary, non-binary categorical, and so on).

The most important method is what the package calls the norm method. This method is very much like stochastic regression. Each of the m imputations is created by adding a normal "noise" term to the output of a linear regression predicting the missing variable. What makes this slightly different than just stochastic regression repeated m times is that the norm method also integrates uncertainty about the regression coefficients used in the predictive linear model.

Recall that the regression coefficients in a linear regression are just estimates of the population coefficients from a random sample (that's why each regression coefficient has a standard error and confidence interval). Another sample from the population would have yielded slightly different coefficient estimates. If through all our imputations, we just added a normal residual term from a linear regression equation with the same coefficients, we would be systematically understating our uncertainty regarding what the imputed values should be.

To combat this, in multiple imputation, each imputation of the data contains two steps. The first step performs stochastic linear regression imputation using coefficients for each predictor estimated from the data. The second step chooses slightly different estimates of these regression coefficients, and proceeds into the next imputation. The first step of the next imputation uses the slightly different coefficient estimates to perform stochastic linear regression imputation again. After that, in the second step of the second iteration, still other coefficient estimates are generated to be used in the third imputation. This cycle goes on until we have m multiply imputed datasets.

How do we choose these different coefficient estimates at the second step of each imputation? Traditionally, the approach is Bayesian in nature; these new coefficients are drawn from each of the coefficients' posterior distribution, which describes credible values of the estimate using the observed data and uninformative priors. This is the approach that norm uses. There is an alternate method that chooses these new coefficient estimates from a sampling distribution that is created by taking repeated samples of the data (with replacement) and estimating the regression coefficients of each of these samples. mice calls this method norm.boot.

The multivariate case is a little more hairy, since the imputation for one column depends on the other columns, which may contain missing data of their own.

For this reason, we make several passes over all the columns that need imputing, until the imputation of all missing data in a particular column is informed by informed estimates of the missing data in the predictor columns. These passes over all the columns are called iterations.

So that you really understand how this iteration works, let's say we are performing multiple imputation on a subset of miss_mtcars containing only mpg, wt and drat. First, all the missing data in all the columns are set to a placeholder value like the mean or a randomly sampled non-missing value from its column. Then, we visit mpg where the placeholder values are turned back into missing values. These missing values are predicted using the two-part procedure described in the univariate case. Then we move on to wt; the placeholder values are turned back into missing values, whose new values are imputed with the two-step univariate procedure using mpg and drat as predictors. Then this is repeated with drat. This is one iteration. On the next iteration, it is not the placeholder values that get turned back into random values and imputed but the imputed values from the previous iteration. As this repeats, we shift away from the starting values and the imputed values begin to stabilize. This usually happens within just a few iterations. The dataset at the completion of the last iteration is the first multiply imputed dataset. Each m starts the iteration process all over again.

The default in mice is five iterations. Of course, you can increase this number if you have reason to believe that you need to. We'll discuss how to tell if this is necessary later in the section.

Methods of imputation

The method of imputation that we described for the univariate case, norm, works best for imputed values that follow an unconstrained normal distribution—but it could lead to some nonsensical imputations otherwise. For example, since the weights in wt are so close to 0 (because it's in units of a thousand pounds) it is possible for the norm method to impute a negative weight. Though this will no doubt balance out over the other m-1 multiply imputed datasets, we can combat this situation by using another method of imputation called predictive mean matching.

Predictive mean matching (mice calls this pmm) works a lot like norm. The difference is that the norm imputations are then used to find the d closest values to the imputed value among the non-missing data in the column. Then, one of these d values is chosen as the final imputed value—d=3 is the default in mice.

This method has a few great properties. For one, the possibility of imputing a negative value for wt is categorically off the table; the imputed value would have to be chosen from the set {1.513, 1.615, 1.835}, since these are the three lowest weights. More generally, any natural constraint in the data (lower or upper bounds, integer count data, numbers rounded to the nearest one-half, and so on) is respected with predictive mean matching, because the imputed values appear in the actual non-missing observed values. In this way, predictive mean matching is like hot-deck imputation. Predictive mean matching is the default imputation method in mice for numerical data, though it may be inferior to norm for small datasets and/or datasets with a lot of missing values.

Many of the other imputation methods in mice are specially suited for one particular data type. For example, binary categorical variables use logreg by default; this is like norm but uses logistic regression to impute a binary outcome. Similarly, non-binary categorical data uses multinomial regression—mice calls this method polyreg.

Multiple imputation in practice

There are a few steps to follow and decisions to make when using this powerful imputation technique:

Are the data MAR?: And be honest! If the mechanism is likely not MAR, then more complicated measures have to be taken.
Are there any derived terms, redundant variables, or irrelevant variables in the data set?: Any of these types of variables will interfere with the regression process. Irrelevant variables—like unique IDs—will not have any predictive power. Derived terms or redundant variables—like having a column for weight in pounds and grams, or a column for area in addition to a length and width column—will similarly interfere with the regression step.
Convert all categorical variables to factors, otherwise mice will not be able to tell that the variable is supposed to be categorical.
Choose number of iterations and m: By default, these are both five. Using five iterations is usually okay—and we'll be able to tell if we need more. Five imputations are usually okay, too, but we can achieve more statistical power from more imputed datasets. I suggest setting m to 20, unless the processing power and time can't be spared.
Choose an imputation method for each variable: You can stick with the defaults as long as you are aware of what they are and think they're the right fit.
1. Choose the predictors: Let mice use all the available columns as predictors as long as derived terms and redundant/irrelevant columns are removed. Not only does using more predictors result in reduced bias, but it also increases the likelihood that the data is MAR.
2. Perform the imputations
3. Audit the imputations
4. Perform analysis with the imputations
5. Pool the results of the analyses

Before we get down to it, let's call the mice function on our data frame with missing data, and use its default arguments, just to see what we shouldn't do and why:

# we are going to set the seed and printFlag to FALSE, but
# everything else will the default argument
imp <- mice(miss_mtcars, seed=3, printFlag=FALSE)
print(imp)

------------------------------

Multiply imputed data set
Call:
mice(data = miss_mtcars, printFlag = FALSE, seed = 3)
Number of multiple imputations:  5
Missing cells per column:
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
   5    5    0    0    7    3    4    3    0    0    0 
Imputation methods:
  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
"pmm" "pmm"    ""    "" "pmm" "pmm" "pmm" "pmm"    ""    ""    "" 
VisitSequence:
 mpg  cyl drat   wt qsec   vs 
   1    2    5    6    7    8 
PredictorMatrix:
     mpg cyl disp hp drat wt qsec vs am gear carb
mpg    0   1    1  1    1  1    1  1  1    1    1
cyl    1   0    1  1    1  1    1  1  1    1    1
disp   0   0    0  0    0  0    0  0  0    0    0
 ...
Random generator seed value:  3

The first thing we notice (on line four of the output) is that mice chose to create five multiply imputed datasets, by default. As we discussed, this isn't a bad default, but more imputation can only improve our statistical power (if only marginally); when we impute this data set in earnest, we will use m=20.

The second thing we notice (on lines 8-10 of the output) is that it used predictive mean matching as the imputation method for all the columns with missing data. If you recall, predictive mean matching is the default imputation method for numeric columns. However, vs and cyl are binary categorical and non-binary categorical variables, respectively. Because we didn't convert them to factors, mice thinks these are just regular numeric columns. We'll have to fix this.

The last thing we should notice here is the predictor matrix (starting on line 14). Each row and column of the predictor matrix refers to a column in the dataset to impute. If a cell contains a 1, it means that the variable referred to in the column is used as a predictor for the variable in the row. The first row indicates that all available attributes are used to help predict mpg with the exception of mpg itself. All the values in the diagonal are 0, because mice won't use an attribute to predict itself. Note that the disp, hp, am, gear, and carb rows all contain `0`s—this is because these variables are complete, and don't need to use any predictors.

Since we thought carefully about whether there were any attributes that should be removed before we perform the imputation, we can use mice's default predictor matrix for this dataset. If there were any non-predictive attributes (like unique identifiers, redundant variables, and so on) we would have either had to remove them (easiest option), or instruct mice not to use them as predictors (harder).

Let's now correct the issues that we've discussed.

# convert categorical variables into factors
miss_mtcars$vs <- factor(miss_mtcars$vs)
miss_mtcars$cyl <- factor(miss_mtcars$cyl)

imp <- mice(miss_mtcars, m=20, seed=3, printFlag=FALSE)
imp$method
-------------------------------------
      mpg       cyl      disp        hp      drat 
    "pmm" "polyreg"        ""        ""     "pmm" 
       wt      qsec        vs        am      gear 
    "pmm"     "pmm"  "logreg"        ""        "" 
     carb 
       ""

Now mice has corrected the imputation method of cyl and vs to their correct defaults. In truth, cyl is a kind of discrete numeric variable called an ordinal variable, which means that yet another imputation method may be optimal for that attribute, but, for the sake of simplicity, we'll treat it as a categorical variable.

Before we get to use the imputations in an analysis, we have to check the output. The first thing we need to check is the convergence of the iterations. Recall that for imputing data with missing values in multiple columns, multiple imputation requires iteration over all these columns a few times. At each iteration, mice produces imputations—and samples new parameter estimates from the parameters' posterior distributions—for all columns that need to be imputed. The final imputations, for each multiply imputed dataset m, are the imputed values from the final iteration.

In contrast to when we used MCMC in Chapter 7, Bayesian Methods the convergence in mice is much faster; it usually occurs in just a few iterations. However, as in Chapter 7, Bayesian Methods, visually checking for convergence is highly recommended. We even check for it similarly; when we call the plot function on the variable that we assign the mice output to, it displays trace plots of the mean and standard deviation of all the variables involved in the imputations. Each line in each plot is one of the m imputations.

plot(imp)

Figure 11.3: A subset of the trace plots produced by plotting an object returned by a mice imputation

As you can see from the preceding trace plot on imp, there are no clear trends and the variables are all overlapping from one iteration to the next. Put another way, the variance within a chain (there are m chains ) should be about equal to the variance between the chains. This indicates that convergence was achieved.

If convergence was not achieved, you can increase the number of iterations that mice employs by explicitly specifying the maxit parameter to the mice function.

Note

To see an example of non-convergence, take a look at Figures 7 and 8 in the paper that describes this package written by the authors of the package' themselves. It is available at http://www.jstatsoft.org/article/view/v045i03.

The next step is to make sure the imputed values are reasonable. In general, whenever we quickly review the results of something to see if they make sense, it is called a sanity test or sanity check. With the following line, we're going to display the imputed values for the five missing mpgs for the first six imputations:

imp$imp$mpg[,1:6]
------------------------------------
                      1    2    3    4    5    6
Duster 360         19.2 16.4 17.3 15.5 15.0 19.2
Cadillac Fleetwood 15.2 13.3 15.0 13.3 10.4 17.3
Chrysler Imperial  10.4 15.0 15.0 16.4 10.4 10.4
Porsche 914-2      27.3 22.8 21.4 22.8 21.4 15.5
Ferrari Dino       19.2 21.4 19.2 15.2 18.1 19.2

These sure look reasonable. A better method for sanity checking is to call densityplot on the variable that we assign the mice output to:

densityplot(imp)

Figure 11.4: Density plots of all the imputed values for mpg, drat, wt, and qsec. Each imputation has its own density curve in each quadrant

This displays, for every attribute imputed, a density plot of the actual non-missing values (the thick line) and the imputed values (the thin lines). We are looking to see that the distributions are similar. Note that the density curve of the imputed values extend much higher than the observed values' density curve in this case. This is partly because we imputed so few variables that there weren't enough data points to properly smooth the density approximation. Height and non-smoothness notwithstanding, these density plots indicate no outlandish behavior among the imputed variables.

We are now ready for the analysis phase. We are going to perform linear regression on each imputed dataset and attempt to model mpg as a function of am, wt, and qsec. Instead of repeating the analyses on each dataset manually, we can apply an expression to all the datasets at one time with the with function, as follows:

imp_models <- with(imp, lm(mpg ~ am + wt + qsec))

We could take a peak at the estimated coefficients from each dataset using lapply on the analyses attribute of the returned object:

lapply(imp_models$analyses, coef)
---------------------------------
[[1]]
(Intercept)          am          wt        qsec 
 18.1534095   2.0284014  -4.4054825   0.8637856 

[[2]]
(Intercept)          am          wt        qsec 
   8.375455    3.336896   -3.520882    1.219775 

[[3]]
(Intercept)          am          wt        qsec 
   5.254578    3.277198   -3.233096    1.337469
.........

Finally, let's pool the results of the analyses (with the pool function), and call summary on it:

pooled_model <- pool(imp_models)
summary(pooled_model)
----------------------------------
                  est        se         t       df    Pr(>|t|)
(Intercept)  7.049781 9.2254581  0.764166 17.63319 0.454873254
am           3.182049 1.7445444  1.824000 21.36600 0.082171407
wt          -3.413534 0.9983207 -3.419276 14.99816 0.003804876
qsec         1.270712 0.3660131  3.471765 19.93296 0.002416595
                  lo 95     hi 95 nmis       fmi    lambda
(Intercept) -12.3611281 26.460690   NA 0.3459197 0.2757138
am           -0.4421495  6.806247    0 0.2290359 0.1600952
wt           -5.5414268 -1.285641    3 0.4324828 0.3615349
qsec          0.5070570  2.034366    4 0.2736026 0.2042003

Though we could have performed the pooling ourselves using the equations that Donald Rubin outlined in his 1987 classic Multiple Imputation for Nonresponse in Surveys, it is less of a hassle and less error-prone to have the pool function do it for us. Readers who are interested in the pooling rules are encouraged to consult the aforementioned text.

As you can see, for each parameter, pool has combined the coefficient estimate and standard errors, and calculated the appropriate degrees of freedom. These allow us to t-test each coefficient against the null hypothesis that the coefficient is equal to 0, produce p-values for the t-test, and construct confidence intervals.

The standard errors and confidence intervals are wider than those that would have resulted from linear regression on a single imputed dataset, but that's because it is appropriately taking into account our uncertainty regarding what the missing values would have been.

There are, at present time, a limited number of analyses that can be automatically pooled by mice—the most important being lm/glm. If you recall, though, the generalized linear model is extremely flexible, and can be used to express a wide array of different analyses. By extension, we could use multiple imputation for not only linear regression but logistic regression, Poisson regression, t-tests, ANOVA, ANCOVA, and more.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Dealing with Messy Data

Create new playlist

Sign In

Sign Up

Chapter 11. Dealing with Messy Data

Analysis with missing data

Visualizing missing data

Types of missing data

So which one is it?

Unsophisticated methods for dealing with missing data

Complete case analysis

Pairwise deletion

Mean substitution

Hot deck imputation

Note

Regression imputation

Stochastic regression imputation

Multiple imputation

So how does mice come up with the imputed values?

Methods of imputation

Multiple imputation in practice

Note

Table of Contents for
11. Dealing with Messy Data