Data imputation

Sometimes, your data may have missing values. This could be due to errors in the data collection process, genuinely missing data, or any other reason, with the net result being that the information is not available. Real world examples of missing data can be found in surveys where the respondent did not answer a specific question on the survey.

You may have a dataset of, say, 1,000 records and 20 columns of which a certain column has 100 missing values. You may choose to discard this column altogether, but that also means discarding 90 percent of the information. You still have 19 other columns that have complete data. Another option is to simply exclude the column, but that means you cannot leverage the benefit afforded by the data that is available in the respective column.

Several methods exist for data imputation, that is, the process of filling in missing data. We do not know what the exact values are, but by looking at the other entries in the table, we may be able to make an educated and systematic assessment of what the values might be.

Some of the common methods in data imputation involve:

  • Mean, median, mode imputation: Substituting the missing values using the mean, median, or mode value for the column. This, however, has the disadvantage of increasing the correlation among the variables that are imputed, which might not be desirable for multivariate analysis.
  • K-nearest neighbors imputation: kNN imputation is a process of using a machine learning approach (nearest-neighbors) in order to impute missing values. It works by finding k records that are most similar to the one that has missing values and calculates the weighted average using Euclidean distance relative to k records.
  • Imputation using regression models: Regression methods use standard regression methods in R to predict the value of the missing variables. However, as noted in the respective section on Regression-based imputation on Wikipedia https://en.wikipedia.org/wiki/Imputation_(statistics)#Regression, the problem (with regression imputation) is that the imputed data do not have an error term included in their estimation. Thus, the estimates fit perfectly along the regression line without any residual variance. This causes relationships to be over identified and suggests greater precision in the imputed values than is warranted.
  • Hot-deck imputation: Another technique for filling missing values with observations from the dataset itself. This method, although very prevalent, does have a limitation in that, by assigning say, a single value, to a large range of missing values, it could add a significant bias in the observations and can produce misleading results.

A short example has been provided here to demonstrate how imputation can be done using kNN Imputation. We simulate missing data by changing a large number of values to NA in the PimaIndiansDiabetes dataset.

We make use of the following factors for the process:

  • We use mean to fill in the NA values.
  • We use kNN imputation to fill in the missing values. We then compare how the two methods performed:
library(DMwR) 
library(caret) 
 
diab<- PimaIndiansDiabetes 
 
# In the dataset, the column mass represents the body mass index 
# Of the individuals represented in the corresponding row 
 
# mass: Body mass index (weight in kg/(height in m)^2) 
 
# Creating a backup of the diabetes dataframe 
diabmiss_orig<- diab 
 
# Creating a separate dataframe which we will modify 
diabmiss<- diabmiss_orig 
 
# Saving the original values for body mass 
actual <- diabmiss_orig$mass 
 
# Change 91 values of mass to NA in the dataset 
diabmiss$mass[10:100] <- NA 
 
# Number of missing values in mass 
sum(is.na(diabmiss$mass)) 
 
# 91 
 
# View the missing values 
diabmiss[5:15,] 

We get the output as follows:

# Test with using the mean, we will set all the missing values 
# To the mean value for the column 
 
diabmiss$mass[is.na(diabmiss$mass)] <- mean(diabmiss$mass,na.rm = TRUE) 
 
# Check the values that have been imputed 
data.frame(actual=actual[10:100], impute_with_mean=diabmiss$mass[10:100]) 

The output of the preceding code is as follows:

# Check the Root-Mean-Squared-Error for the entire column 
# Root Mean Squared Error provides an estimate for the 
# Difference between the actual and the predicted values 
# On 'average' 
 
diabmissdf<- data.frame(actual=actual, impute_with_mean=diabmiss$mass) 
rmse1 <- RMSE(diabmissdf$impute_with_mean,actual) 
rmse1 
 
# [1] 3.417476 
 
# We will re-run the exercise using knnImputation (from package DMwR) 
 
# Change the value of the records back to NA 
diabmiss<- diabmiss_orig 
diabmiss$mass[10:100] <- NA 
 
# Perform knnImputation 
diabknn<- knnImputation(diabmiss,k=25) 
 
# Check the RMSE value for the knnImputation method 
rmse2 <- RMSE(diabknn$mass,actual) 
rmse2 
 
# [1] 3.093827 
 
# Improvement using the knnImputation methods in percentage terms 
 
100 * (rmse1-rmse2)/rmse1 
 
[1] 22.20689 

While it may not represent a dramatic change, it's still better than using a naïve approach such as using simply a mean or constant value.

There are several packages in R for data imputation. A few prominent ones are as follows:

  • Amelia II: Missing information in time-series data

https://gking.harvard.edu/amelia

  • Hot-deck imputation with R package: HotDeckImputation and hot.deck

https://cran.r-project.org/web/packages/HotDeckImputation/

https://cran.r-project.org/web/packages/hot.deck/

  • Multivariate imputation (by Chained Equations)

https://cran.r-project.org/web/packages/mice/index.html

  • Imputing values in a Bayesian framework with R package: mi

https://cran.r-project.org/web/packages/mi/index.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.152