The mice package

The mice package provides an alternative approach to imputation, being based on a Bayesian procedure without assuming multivariate normality. The approach used by mice relies upon a procedure known as "chained equations". The basic operation goes as follows:

  1. Some preliminary starting imputation values are filled in for all missing data.
  2. The imputation starts for some particular variable, which we will call X. Starting imputation values are removed from X and missing data in X is, once again, treated as missing. Preliminary imputations are left in place of all other variables. Observed elements of X are regressed on, or somehow matched to the other variables (values of these other variables may be observed or imputed). Based on this regression or matching, the missing values of X are filled in.
  3. Step 2 is then repeated for the next variable in line Y. Step 2 is then repeated for variable Z, and this goes on until the imputation has been done for all variables.
  4. Steps 2 and 3 are then repeated over and over again. The imputed values at the end of step 2 are tracked with each imputation, and once they are no longer changing much, the imputation is thought to have converged.

    Note

    An important note here is that many different functions can be used in step 2 to perform the regression or matching, and the user must select which function he or she wishes to use.

Imputation functions in mice

The mice command in the mice package supports a large range of imputation methods, many of which do not assume normality. For categorical data, mice uses logistic regression or multinomial logistic regression. The default method for handling numeric data in mice is with predictive mean matching, which relies on filling in missing data with donated values from observed data. The donor values are selected based on the closest match from linear regression. The advantage of this approach over a simple regression-based approach is that it is less sensitive to violations of normality, and since it borrows observed values to fill in missing data, nonsensical imputed data is less of a concern.

The mice command also supports a range of other imputation functions, which the user can specify.

Before we go on, we will rename the variables with their respective letters for simplicity, as follows:

names(phys.func.rm) <- LETTERS[c(1:20)]

The mice command allows us to specify which variables will be used in the imputation of other variables as a matrix. Typically, we try to use all of the relevant variables to perform imputations. This consideration is especially important if the dataset has a large number of variables. This means, for instance, a variable like hair color would not be used to impute something like age. Here, we demonstrate the use of the predictor matrix in the physical functioning dataset, which has only 20 variables. In the prior section, we discussed the possibility that some of these variables relate to social engagement and cognition, some relate to leg function, and some relate to arm function. We will use this grouping for imputation prediction.

We create a square matrix. Each row represents the variable to be imputed (for example, the first row is A, the second row is B, and so on). Each column represents the variable being used as a predictor. Let's have a look at the following matrix:

predictor.matrix <- matrix(
  c(
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,
  0,0,1,1,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,
  0,1,0,1,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,
  0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,
  0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,1,0,0,0,1,
  0,0,0,0,1,0,1,0,0,0,1,1,0,0,1,1,0,0,0,1,
  0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,1,0,0,0,1,
  0,1,1,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,
  0,1,1,1,0,0,0,1,0,1,0,0,1,1,0,0,0,0,0,0,
  0,1,1,1,0,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,
  0,0,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,
  0,0,0,0,1,1,1,0,0,0,1,0,0,0,1,1,0,0,0,1,
  0,1,1,1,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,0,
  0,1,1,1,0,0,0,1,1,1,0,0,1,0,0,0,0,0,0,0,
  0,0,0,0,1,1,1,0,0,0,1,1,0,0,0,1,0,0,0,1,
  0,0,0,0,1,1,1,0,0,0,1,1,0,0,1,0,0,0,0,1,
  1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,
  1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,
  1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,
  0,0,0,0,1,1,1,0,0,0,1,1,0,0,1,0,0,0,0,0
  ),
  nrow = 20,
  byrow = TRUE
)

We will then impute the missing values obtaining five imputations, and allowing up to six iterations to obtain the imputations, as follows:

imputed.phys.func <- mice(phys.func.rm, predictorMatrix = predictor.matrix, m = 5, seed = 10, maxit = 6)

Once again, we now have five imputed datasets, but we have not found any new results. What if we want to ask whether the total leg function score is predicted by the total arm function score? We can use the with command to apply a regression model to each individual dataset as follows:

legs.v.arms.models <- with(imputed.phys.func, lm( I(B+C+D+H+I+J+M+N) ~ I(E+F+G+K+L+O+P+T) ))

We can then pool the results to get estimates as follows:

leg.v.arm.pool <- pool(legs.v.arms.models)

The summary function on a mipo object (returned by pool) will give additional interesting information, as follows:

summary(leg.v.arm.pool)

Let's have a look at the following screenshot:

Imputation functions in mice

We get the estimation of the intercept and slope for the regression model in addition to the standard error and typical regression model statistics. However, we also get the fraction of information missing (column fmi) and the variance attributable to missing data (column lambda). As we can see here, a little over a fifth of variability is attributable to missing data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.223.190