And sometimes omitting missing values is not reasonable or possible at all, for example due to the low number of observations or if it seems that missing data is not random. Data imputation is a real alternative in such situations, and this method can replace NA
with some real values based on various algorithms, such as filling empty cells with:
The hot-deck method is often used while joining multiple datasets together. In such a situation, the roll
argument of data.table
can be very useful and efficient, otherwise be sure to check out the hotdeck
function in the VIM
package, which offers some really useful ways of visualizing missing data. But when dealing with an already given column of a dataset, we have some other simple options as well.
For instance, imputing a known scalar is a pretty simple situation, where we know that all missing values are for example due to some research design patterns. Let's think of a database that stores the time you arrived to and left the office every weekday, and by computing the difference between those two, we can analyze the number of work hours spent in the office from day to day. If this variable returns NA
for a time period, actually it means that we were outside of the office all day, so thus the computed value should be zero instead of NA
.
And not just in theory, but this is pretty easy to implement in R as well (example is continued from the previous demo code where we defined m
with two missing values):
> m[which(is.na(m), arr.ind = TRUE)] <- 0 > m [,1] [,2] [,3] [1,] 1 0 7 [2,] 2 5 0 [3,] 3 6 9
Similarly, replacing missing values with a random number, a sample
of other values or with the mean
of a variable can be done relatively easily:
> ActualElapsedTime <- hflights$ActualElapsedTime > mean(ActualElapsedTime, na.rm = TRUE) [1] 129.3237 > ActualElapsedTime[which(is.na(ActualElapsedTime))] <- + mean(ActualElapsedTime, na.rm = TRUE) > mean(ActualElapsedTime) [1] 129.3237
Which can be even easier with the impute
function from the Hmisc
package:
> library(Hmisc) > mean(impute(hflights$ActualElapsedTime, mean)) [1] 129.3237
It seems that we have preserved the value of the arithmetic mean of course, but you should be aware of some very serious side-effects:
> sd(hflights$ActualElapsedTime, na.rm = TRUE) [1] 59.28584 > sd(ActualElapsedTime) [1] 58.81199
When replacing missing values with the mean, the variance of the transformed variable will be naturally lower compared to the original distribution. This can be extremely problematic in some situations, where some more sophisticated methods are needed.
Besides the previous mentioned univariate methods, you may also fit models on the complete cases in the dataset, rather than fitting those models on the remaining rows to estimate the missing values. Or in a nutshell, we are replacing the missing values with multivariate predictions.
There are a plethora of related functions and packages, for example you might be interested in checking the transcan
function in the Hmisc
package, or the imputeR
package, which includes a wide variety of models for imputing categorical and continuous variables as well.
Most of the imputation methods and models are for one type of variable: either continuous or categorical. In case of mixed-type dataset, we typically use different algorithms to handle the different types of missing data. The problem with this approach is that some of the possible relations between different types of data might be ignored, resulting in some partial models.
To overcome this issue, and to save a few pages in the book on the description of the traditional regression and other related methods for data imputation (although you can find some related methods in the Chapter 5, Buildings Models (authored by Renata Nemeth and Gergely Toth) and the Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)), we will concentrate on a non-parametric method that can handle categorical and continuous variables at the same time via a very user-friendly interface in the missForest
package.
This iterative procedure fits a random forest model on the available data in order to predict the missing values. As our hflights
data is relatively large for such a process and running the sample code would takes ages, we will rather use the standard iris
dataset in the next examples.
First let's see the original structure of the dataset, which does not include any missing values:
> summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50
Now let's load the package and add some missing values (completely at random) to the dataset in the means of producing a reproducible minimal example for the forthcoming models:
> library(missForest) > set.seed(81) > miris <- prodNA(iris, noNA = 0.2) > summary(miris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.100 Min. :0.100 1st Qu.:5.200 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.450 Median :1.300 Mean :5.878 Mean :3.062 Mean :3.905 Mean :1.222 3rd Qu.:6.475 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.900 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 NA's :28 NA's :29 NA's :32 NA's :33 Species setosa :40 versicolor:38 virginica :44 NA's :28
So now we have around 20 percent of missing values in each column, which is also stated in the bottom row of the preceding summary. The number of completely random missing values is between 28 and 33 cases per variable.
The next step should be building the random forest models to replace the missing values with real numbers and factor levels. As we also have the original dataset, we can use that complete matrix to test the performance of the method via the xtrue
argument, which computes and returns the error rate when we call the function with verbose
. This is useful in such didactical examples to show how the model and predictions improves from iteration to iteration:
> iiris <- missForest(miris, xtrue = iris, verbose = TRUE) missForest iteration 1 in progress...done! error(s): 0.1512033 0.03571429 estimated error(s): 0.1541084 0.04098361 difference(s): 0.01449533 0.1533333 time: 0.124 seconds missForest iteration 2 in progress...done! error(s): 0.1482248 0.03571429 estimated error(s): 0.1402145 0.03278689 difference(s): 9.387853e-05 0 time: 0.114 seconds missForest iteration 3 in progress...done! error(s): 0.1567693 0.03571429 estimated error(s): 0.1384038 0.04098361 difference(s): 6.271654e-05 0 time: 0.152 seconds missForest iteration 4 in progress...done! error(s): 0.1586195 0.03571429 estimated error(s): 0.1419132 0.04918033 difference(s): 3.02275e-05 0 time: 0.116 seconds missForest iteration 5 in progress...done! error(s): 0.1574789 0.03571429 estimated error(s): 0.1397179 0.04098361 difference(s): 4.508345e-05 0 time: 0.114 seconds
The algorithm ran for 5 iterations before stopping, when it seemed that the error rate was not improving any further. The returned missForest
object includes a few other values besides the imputed dataset:
> str(iiris) List of 3 $ ximp :'data.frame': 150 obs. of 5 variables: ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 ... ..$ Sepal.Width : num [1:150] 3.5 3.3 3.2 3.29 3.6 ... ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.42 1.4 ... ..$ Petal.Width : num [1:150] 0.2 0.218 0.2 0.2 0.2 ... ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: ... $ OOBerror: Named num [1:2] 0.1419 0.0492 ..- attr(*, "names")= chr [1:2] "NRMSE" "PFC" $ error : Named num [1:2] 0.1586 0.0357 ..- attr(*, "names")= chr [1:2] "NRMSE" "PFC" - attr(*, "class")= chr "missForest"
The Out of Box error is an estimate on how good our model was based on the normalized root mean squared error computed (NRMSE) for numeric values and the proportion of falsely classified (PFC) entries for factors. And as we also provided the complete dataset for the previously run model, we also get the true imputation error ratio – which is pretty close to the above estimates.
Please find more details on random forests and related machine learning topics in the Chapter 10, Classification and Clustering.
But how does this approach compare to a much simpler imputation method, like replacing missing values with the mean?
In the comparison, only the first four columns of the iris
dataset will be used, thus it is not dealing with the factor variable at the moment. Let's prepare this demo dataset:
> miris <- miris[, 1:4]
In iris_mean
, we replace all the missing values to the mean of the actual columns:
> iris_mean <- impute(miris, fun = mean)
And in iris_forest
, we predict the missing values by fitting random forest model:
> iris_forest <- missForest(miris) missForest iteration 1 in progress...done! missForest iteration 2 in progress...done! missForest iteration 3 in progress...done! missForest iteration 4 in progress...done! missForest iteration 5 in progress...done!
Now let's simply check the accuracy of the two models by comparing the correlations of iris_mean
and iris_forest
with the complete iris
dataset. For iris_forest
, we will extract the actual imputed dataset from the ximp
attribute, and we will silently ignore the factor variable of the original iris
table:
> diag(cor(iris[, -5], iris_mean)) Sepal.Length Sepal.Width Petal.Length Petal.Width 0.6633507 0.8140169 0.8924061 0.4763395 > diag(cor(iris[, -5], iris_forest$ximp)) Sepal.Length Sepal.Width Petal.Length Petal.Width 0.9850253 0.9320711 0.9911754 0.9868851
These results suggest that the nonparametric random forest model did a lot better job compared to the simple univariate solution of replacing missing values with the mean.
Please note that these methods have their drawbacks likewise. Replacing the missing values with a predicted one often lacks any error term and residual variance with most models.
This also means that we are lowering the variability, and overestimating some association in the dataset at the same time, which can seriously affect the results of our data analysis. For this, some simulation techniques were introduced in the past to overcome the problem of distorting the dataset and our hypothesis tests with some arbitrary models.
The basic idea behind multiple imputation is to fit models several times in a row on the missing values. This Monte Carlo method usually creates some (like 3 to 10) parallel versions of the simulated complete dataset, each of these is analyzed separately, and then we combine the results to produce the actual estimates and confidence intervals. See for example the aregImpute
function from the Hmisc
package for more details.
On the other hand, do we really have to remove or impute missing values in all cases? For more details on this question, please see the last section of this chapter. But before that, let's get to know some other requirements for polishing data.
18.191.237.194