Best practice 4 – dealing with missing data

Due to various reasons, datasets in the real world are rarely completely clean and often contain missing or corrupted values. They are usually presented as blanks, Null, -1, 999999, unknown, or any other placeholder. Samples with missing data not only provide incomplete predictive information, but also confuse the machine learning model as it can not tell whether -1 or unknown holds a meaning. It is important to pinpoint and deal with missing data in order to avoid jeopardizing the performance of models in later stages.

Here are three basic strategies that we can use to tackle the missing data issue:

Discarding samples containing any missing value
Discarding fields containing missing values in any sample

Inferring the missing values based on the known part from the attribute. This process is called missing data imputation. Typical imputation methods include replacing missing values with mean or median value of the field across all samples, or the most frequent value for categorical data.

The first two strategies are simple to implement; however, they come at the expense of the data lost, especially when the original dataset is not large enough. The third strategy doesn't abandon any data, but does try to fill in the blanks.

Let's look at how each strategy is applied in an example where we have a dataset (age, income) consisting of six samples (30, 100), (20, 50), (35, unknown), (25, 80), (30, 70), and (40, 60):

If we process this dataset using the first strategy, it becomes (30, 100), (20, 50), (25, 80), (30, 70), and (40, 60)
If we employ the second strategy, the dataset becomes (30), (20), (35), (25), (30), and (40), where only the first field remains
If we decide to complete the unknown value instead of skipping it, the sample (35, unknown) can be transformed into (35, 72) with the mean of the rest values in the second field, or (35, 70), with the median value in the second field

In scikit-learn, the Imputer class provides a nicely written imputation transformer. We herein use it for the following small example:

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer

Represent the unknown value by np.nan in numpy, as detailed in the following:

>>> data_origin = [[30, 100],
...                [20, 50],
...                [35, np.nan],
...                [25, 80],
...                [30, 70],
...                [40, 60]]

Initialize the imputation transformer with the mean value and obtain such information from the original data:

>>> imp_mean = Imputer(missing_values='NaN', strategy='mean')
>>> imp_mean.fit(data_origin)

Complete the missing value as follows:

>>> data_mean_imp = imp_mean.transform(data_origin)
>>> print(data_mean_imp)
[[ 30. 100.]
 [ 20. 50.]
 [ 35. 72.]
 [ 25. 80.]
 [ 30. 70.]
 [ 40. 60.]]

Similarly, initialize the imputation transformer with the median value, as detailed in the following:

>>> imp_median = Imputer(missing_values='NaN', strategy='median')
>>> imp_median.fit(data_origin)
>>> data_median_imp = imp_median.transform(data_origin)
>>> print(data_median_imp)
[[ 30. 100.]
 [ 20. 50.]
 [ 35. 70.]
 [ 25. 80.]
 [ 30. 70.]
 [ 40. 60.]]

When new samples come in, the missing values (in any attribute) can be imputed using the trained transformer, for example, with the mean value, as shown here:

>>> new = [[20, np.nan],
...        [30, np.nan],
...        [np.nan, 70],
...        [np.nan, np.nan]]
>>> new_mean_imp = imp_mean.transform(new)
>>> print(new_mean_imp)
[[ 20. 72.]
 [ 30. 72.]
 [ 30. 70.]
 [ 30. 72.]]

Note that 30 in the age field is the mean of those six age values in the original dataset.

Now that we have seen how imputation works as well as its implementation, let's explore how the strategy of imputing missing values and discarding missing data affects the prediction results through the following example:

First we load the diabetes dataset and simulate a corrupted dataset with missing values, as shown here:

>>> from sklearn import datasets
>>> dataset = datasets.load_diabetes()
>>> X_full, y = dataset.data, dataset.target

Simulate a corrupted dataset by adding 25% missing values:

>>> m, n = X_full.shape
>>> m_missing = int(m * 0.25)
>>> print(m, m_missing)
442 110

Randomly select the m_missing samples, as follows:

>>> np.random.seed(42)
>>> missing_samples = np.array([True] * m_missing + 
                               [False] * (m - m_missing))
>>> np.random.shuffle(missing_samples)

For each missing sample, randomly select 1 out of n features:

>>> missing_features = np.random.randint(low=0, high=n, 
                                         size=m_missing)

Represent missing values by nan, as shown here:

>>> X_missing = X_full.copy()
>>> X_missing[np.where(missing_samples)[0], missing_features] = 
                                                          np.nan

Then we deal with this corrupted dataset by discarding the samples containing a missing value:

>>> X_rm_missing = X_missing[~missing_samples, :]
>>> y_rm_missing = y[~missing_samples]

Measure the effects of using this strategy by estimating the averaged regression score, R², with a regression forest model in a cross-validation manner. Estimate R² on the dataset with the missing samples removed, as follows:

>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import cross_val_score
>>> regressor = RandomForestRegressor(random_state=42, 
                                  max_depth=10, n_estimators=100)
>>> score_rm_missing = cross_val_score(regressor, X_rm_missing, 
                                             y_rm_missing).mean()
>>> print('Score with the data set with missing samples removed: 
                                 {0:.2f}'.format(score_rm_missing))
Score with the data set with missing samples removed: 0.39

Now we approach the corrupted dataset differently by imputing missing values with the mean, shown here:

>>> imp_mean = Imputer(missing_values='NaN', strategy='mean')
>>> X_mean_imp = imp_mean.fit_transform(X_missing)

Similarly, measure the effects of using this strategy by estimating the averaged R², as follows:

>>> regressor = RandomForestRegressor(random_state=42, 
                                 max_depth=10, n_estimators=100)
>>> score_mean_imp = cross_val_score(regressor, X_mean_imp, 
                                                   y).mean()
>>> print('Score with the data set with missing values replaced by 
                             mean: {0:.2f}'.format(score_mean_imp))
Score with the data set with missing values replaced by mean: 0.42

An imputation strategy works better than discarding in this case. So, how far is the imputed dataset from the original full one? We can check it again by estimating the averaged regression score on the original dataset, as follows:

>>> regressor = RandomForestRegressor(random_state=42, 
                                  max_depth=10, n_estimators=500)
>>> score_full = cross_val_score(regressor, X_full, y).mean()
>>> print 'Score with the full data set: 
                             {0:.2f}'.format(score_full)
Score with the full data set: 0.44

It turns out that little information is comprised in the completed dataset.

However, there is no guarantee that an imputation strategy always works better, and sometimes dropping samples with missing values can be more effective. Hence, it is a great practice to compare the performances of different strategies via cross-validation as we have done previously.

Table of Contents for Best practice 4&#xA0;&#x2013; dealing with missing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practice 4 – dealing with missing data