Dealing with missing values

IoT data is notoriously messy (just in case the message has not been driven home yet) and missing values are a common occurrence. There are some options to deal with this problem in order to enhance the quality of your ML models. This is where the art comes into play and judgment is important.

The following are some methods for handling missing values:

  • Remove data rows with missing values: This is crude, but if only a small percentage is lost, and this percentage appears to be random, then it will have minimal effect on the results. Use a tool such as Tableau to analyze the data with missing values and compare it to the data without missing values to judge the impact of removing the rows. R and Python work well for this task also.
  • Do not use features with a high number of missing values: Just take them out. The effectiveness of a resulting model built with a feature that has a high percentage of imputed values will work about as well as a bicycle held together by bubble gum and toothpicks. The results will be questionable.
  • You can impute the values using the mean, median, or mode of the valid values for the feature: This is somewhat unrefined but can work well in some situations. Always analyze the data using techniques, such as what was introduced in Chapter 6, Getting to Know Your Data - Exploring IoT Data, to determine what makes sense.
  • Create an ML model to impute the values based on the other features, then use the results in the ML model that predicts the target variable: (what you want to predict, which is the purpose of building the model). Nested modeling. Now, we are cooking with gas!

The mice package in R is useful for identifying and handling missing values. The name mice is short for Multivariate Imputation by Chained Equations. It has multiple functions to do some advanced imputation to fill in missing values. It can use ML techniques, such as random forests and logistic regression, to impute values.

The R mice package. These guys impute missing values and live off residual keyboard cheese

The following code demonstrates a very simple example of using mice to impute values. We will start by loading in a sample dataset, airquality, which comes with the R installation. It represents data similar to what may be obtained with IoT devices:

#make sure all needed packages are installed
if(!require(mice)){
install.packages("mice")
}
if(!require(VIM)){
install.packages("VIM")
}
if(!require(lattice)){
install.packages("lattice")
}

library(mice)
library(VIM)
library(lattice)

#load the airquality dataset (comes with R)
mice_example_data <- airquality

Next, we will summarize the data to view statistics and have an idea of where the missing values are:

#summarize original data.  Note NAs in Temp
summary(airquality)

The summary show the following results. Note the pattern of missing values (NAs):

We will remove some more values to demonstrate how mice can impute missing data:

#remove some data from the Temp field
mice_example_data[1:5,4] <- NA
#removed some data from the Wind field
mice_example_data[6:10,3] <-NA

#show summary, note the NA count for Temp
summary(mice_example_data)

The summary now looks like the following:

We can use mice to have a more sophisticated look at the patterns in missing data. The md.pattern() function will show the frequency of missing values by features in combinations:

#use mice to look at pattern of missing data. The first unnamed column shows the count of rows 
#with the complete or missing pattern as indicated by a 1 or 0 (missing) in the named columns
#The result shows 107 rows with complete data in all rows. 35 rows with only Ozone missing, 4 rows with only Solar.R missing, etc.
md.pattern(mice_example_data)

The output window will show the missing value pattern:

A more visual way to look at patterns is using an aggregation plot with the VIM package. The code and the following graph show where values are missing (red). The percentage numbers to the right also show how much of that feature is missing values. Watch out for any feature with over 5% missing. This gets tricky to impute appropriately and may be better left out of the dataset when building ML models. Use your best judgment. Ozone would be the only feature in this dataset over 5%:

#Let's view it visually using the VIM package
aggr_plot <- aggr(mice_example_data, col=c('gray','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram - missing data","Pattern"))
Aggregate plot of air quality data with added missing values

Now, we can use mice to impute the values. We will use the default of five sets of imputations:

#Remove the categorical variables for Month and Day before imputing.
#The mice() function is a Markov Chain Monte Carlo (MCMC) method that uses correlation of the data and
#imputes missing values for each feature m times (default is 5) by using regression of incomplete variables
#on the other variables iteratively with the maximum iterations set by maxit.
imputed_example_data = mice(mice_example_data[-c(5,6)], m=5, printFlag=FALSE, maxit = 50, seed=250)

We can view how the imputed values compare against the actual known values using a density plot. The known values are shown in blue and the imputed values are in light red. Remember that we set it to impute five iterations of values, so there are five light red lines:

#view density plot of results. Blue line is the observed actual data, the red lines are from the imputed data
densityplot(imputed_example_data)
Density plot showing actual values against the imputed datasets

Finally, we can review the summary of the completed dataset including imputed values:

#use the complete function to get the dataset with imputed values filled in
completed_example_data <- complete(imputed_example_data,1)
#now let's look at the summary
summary(completed_example_data)

There are many imputation methods in mice; using the following code, you can get the full list of options on the menu:

methods(mice)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.218.230