Data imputation

And sometimes omitting missing values is not reasonable or possible at all, for example due to the low number of observations or if it seems that missing data is not random. Data imputation is a real alternative in such situations, and this method can replace NA with some real values based on various algorithms, such as filling empty cells with:

  • A known scalar
  • The previous value appearing in the column (hot-deck)
  • A random element from the same column
  • The most frequent value in the column
  • Different values from the same column with given probability
  • Predicted values based on regression or machine learning models

The hot-deck method is often used while joining multiple datasets together. In such a situation, the roll argument of data.table can be very useful and efficient, otherwise be sure to check out the hotdeck function in the VIM package, which offers some really useful ways of visualizing missing data. But when dealing with an already given column of a dataset, we have some other simple options as well.

For instance, imputing a known scalar is a pretty simple situation, where we know that all missing values are for example due to some research design patterns. Let's think of a database that stores the time you arrived to and left the office every weekday, and by computing the difference between those two, we can analyze the number of work hours spent in the office from day to day. If this variable returns NA for a time period, actually it means that we were outside of the office all day, so thus the computed value should be zero instead of NA.

And not just in theory, but this is pretty easy to implement in R as well (example is continued from the previous demo code where we defined m with two missing values):

> m[which(is.na(m), arr.ind = TRUE)] <- 0
> m
     [,1] [,2] [,3]
[1,]    1    0    7
[2,]    2    5    0
[3,]    3    6    9

Similarly, replacing missing values with a random number, a sample of other values or with the mean of a variable can be done relatively easily:

> ActualElapsedTime <- hflights$ActualElapsedTime
> mean(ActualElapsedTime, na.rm = TRUE)
[1] 129.3237
> ActualElapsedTime[which(is.na(ActualElapsedTime))] <-
+   mean(ActualElapsedTime, na.rm = TRUE)
> mean(ActualElapsedTime)
[1] 129.3237

Which can be even easier with the impute function from the Hmisc package:

> library(Hmisc)
> mean(impute(hflights$ActualElapsedTime, mean))
[1] 129.3237

It seems that we have preserved the value of the arithmetic mean of course, but you should be aware of some very serious side-effects:

> sd(hflights$ActualElapsedTime, na.rm = TRUE)
[1] 59.28584
> sd(ActualElapsedTime)
[1] 58.81199

When replacing missing values with the mean, the variance of the transformed variable will be naturally lower compared to the original distribution. This can be extremely problematic in some situations, where some more sophisticated methods are needed.

Modeling missing values

Besides the previous mentioned univariate methods, you may also fit models on the complete cases in the dataset, rather than fitting those models on the remaining rows to estimate the missing values. Or in a nutshell, we are replacing the missing values with multivariate predictions.

There are a plethora of related functions and packages, for example you might be interested in checking the transcan function in the Hmisc package, or the imputeR package, which includes a wide variety of models for imputing categorical and continuous variables as well.

Most of the imputation methods and models are for one type of variable: either continuous or categorical. In case of mixed-type dataset, we typically use different algorithms to handle the different types of missing data. The problem with this approach is that some of the possible relations between different types of data might be ignored, resulting in some partial models.

To overcome this issue, and to save a few pages in the book on the description of the traditional regression and other related methods for data imputation (although you can find some related methods in the Chapter 5, Buildings Models (authored by Renata Nemeth and Gergely Toth) and the Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)), we will concentrate on a non-parametric method that can handle categorical and continuous variables at the same time via a very user-friendly interface in the missForest package.

This iterative procedure fits a random forest model on the available data in order to predict the missing values. As our hflights data is relatively large for such a process and running the sample code would takes ages, we will rather use the standard iris dataset in the next examples.

First let's see the original structure of the dataset, which does not include any missing values:

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  

Now let's load the package and add some missing values (completely at random) to the dataset in the means of producing a reproducible minimal example for the forthcoming models:

> library(missForest)
> set.seed(81)
> miris <- prodNA(iris, noNA = 0.2)
> summary(miris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.100   Min.   :0.100  
 1st Qu.:5.200   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.450   Median :1.300  
 Mean   :5.878   Mean   :3.062   Mean   :3.905   Mean   :1.222  
 3rd Qu.:6.475   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.900  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 NA's   :28      NA's   :29      NA's   :32      NA's   :33     
       Species  
 setosa    :40  
 versicolor:38  
 virginica :44  
 NA's      :28  

So now we have around 20 percent of missing values in each column, which is also stated in the bottom row of the preceding summary. The number of completely random missing values is between 28 and 33 cases per variable.

The next step should be building the random forest models to replace the missing values with real numbers and factor levels. As we also have the original dataset, we can use that complete matrix to test the performance of the method via the xtrue argument, which computes and returns the error rate when we call the function with verbose. This is useful in such didactical examples to show how the model and predictions improves from iteration to iteration:

> iiris <- missForest(miris, xtrue = iris, verbose = TRUE)
  missForest iteration 1 in progress...done!
    error(s): 0.1512033 0.03571429 
    estimated error(s): 0.1541084 0.04098361 
    difference(s): 0.01449533 0.1533333 
    time: 0.124 seconds

  missForest iteration 2 in progress...done!
    error(s): 0.1482248 0.03571429 
    estimated error(s): 0.1402145 0.03278689 
    difference(s): 9.387853e-05 0 
    time: 0.114 seconds

  missForest iteration 3 in progress...done!
    error(s): 0.1567693 0.03571429 
    estimated error(s): 0.1384038 0.04098361 
    difference(s): 6.271654e-05 0 
    time: 0.152 seconds

  missForest iteration 4 in progress...done!
    error(s): 0.1586195 0.03571429 
    estimated error(s): 0.1419132 0.04918033 
    difference(s): 3.02275e-05 0 
    time: 0.116 seconds

  missForest iteration 5 in progress...done!
    error(s): 0.1574789 0.03571429 
    estimated error(s): 0.1397179 0.04098361 
    difference(s): 4.508345e-05 0 
    time: 0.114 seconds

The algorithm ran for 5 iterations before stopping, when it seemed that the error rate was not improving any further. The returned missForest object includes a few other values besides the imputed dataset:

> str(iiris)
List of 3
 $ ximp    :'data.frame':  150 obs. of  5 variables:
  ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 ...
  ..$ Sepal.Width : num [1:150] 3.5 3.3 3.2 3.29 3.6 ...
  ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.42 1.4 ...
  ..$ Petal.Width : num [1:150] 0.2 0.218 0.2 0.2 0.2 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: ...
 $ OOBerror: Named num [1:2] 0.1419 0.0492
  ..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
 $ error   : Named num [1:2] 0.1586 0.0357
  ..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
 - attr(*, "class")= chr "missForest"

The Out of Box error is an estimate on how good our model was based on the normalized root mean squared error computed (NRMSE) for numeric values and the proportion of falsely classified (PFC) entries for factors. And as we also provided the complete dataset for the previously run model, we also get the true imputation error ratio – which is pretty close to the above estimates.

Note

Please find more details on random forests and related machine learning topics in the Chapter 10, Classification and Clustering.

But how does this approach compare to a much simpler imputation method, like replacing missing values with the mean?

Comparing different imputation methods

In the comparison, only the first four columns of the iris dataset will be used, thus it is not dealing with the factor variable at the moment. Let's prepare this demo dataset:

> miris <- miris[, 1:4]

In iris_mean, we replace all the missing values to the mean of the actual columns:

> iris_mean <- impute(miris, fun = mean)

And in iris_forest, we predict the missing values by fitting random forest model:

> iris_forest <- missForest(miris)
  missForest iteration 1 in progress...done!
  missForest iteration 2 in progress...done!
  missForest iteration 3 in progress...done!
  missForest iteration 4 in progress...done!
  missForest iteration 5 in progress...done!

Now let's simply check the accuracy of the two models by comparing the correlations of iris_mean and iris_forest with the complete iris dataset. For iris_forest, we will extract the actual imputed dataset from the ximp attribute, and we will silently ignore the factor variable of the original iris table:

> diag(cor(iris[, -5], iris_mean))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.6633507    0.8140169    0.8924061    0.4763395 
> diag(cor(iris[, -5], iris_forest$ximp))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.9850253    0.9320711    0.9911754    0.9868851

These results suggest that the nonparametric random forest model did a lot better job compared to the simple univariate solution of replacing missing values with the mean.

Not imputing missing values

Please note that these methods have their drawbacks likewise. Replacing the missing values with a predicted one often lacks any error term and residual variance with most models.

This also means that we are lowering the variability, and overestimating some association in the dataset at the same time, which can seriously affect the results of our data analysis. For this, some simulation techniques were introduced in the past to overcome the problem of distorting the dataset and our hypothesis tests with some arbitrary models.

Multiple imputation

The basic idea behind multiple imputation is to fit models several times in a row on the missing values. This Monte Carlo method usually creates some (like 3 to 10) parallel versions of the simulated complete dataset, each of these is analyzed separately, and then we combine the results to produce the actual estimates and confidence intervals. See for example the aregImpute function from the Hmisc package for more details.

On the other hand, do we really have to remove or impute missing values in all cases? For more details on this question, please see the last section of this chapter. But before that, let's get to know some other requirements for polishing data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.194