Handling the missing data

Another common requirement in feature engineering is the handling of missing data. For example, we might have a dataset that looks like this:

In [11]: from numpy import nan
...      X = np.array([[ nan, 0,   3 ],
...                    [ 2,   9,  -8 ],
...                    [ 1,   nan, 1 ],
...                    [ 5,   2,   4 ],
...                    [ 7,   6,  -3 ]])

Most machine learning algorithms cannot handle the Not a Number (NAN) values (nan in Python). Instead, we first have to replace all of the nan values with some appropriate fill values. This is known as the imputation of missing values.

Three different strategies to impute missing values are offered by scikit-learn:

mean: Replaces all of the nan values with a mean value along a specified axis of the matrix (default: axis = 0)
median: Replaces all of the nan values with a median value along a specified axis of the matrix (default: axis = 0)
most_frequent: Replaces all of the nan values with the most frequent value along a specified axis of the matrix (default: axis = 0)

For example, the mean imputer can be called as follows:

In [12]: from sklearn.impute import SimpleImputer
...      imp = SimpleImputer(strategy='mean')
...      X2 = imp.fit_transform(X)
...      X2
Out[12]: array([[ 3.75, 0. , 3. ],
                [ 2. , 9. , -8. ],
                [ 1. , 4.25, 1. ],
                [ 5. , 2. , 4. ],
                [ 7. , 6. , -3. ]])

This replaced the two nan values with fill values equivalent to the mean value calculated along the corresponding columns. We can double-check the math by calculating the mean across the first column (without counting the first element, X[0, 0]), and comparing this number to the first element in the matrix, X2[0, 0]:

In [13]: np.mean(X[1:, 0]), X2[0, 0]
Out[13]: (3.75, 3.75)

Similarly, the median strategy relies on the same code:

In [14]: imp = SimpleImputer(strategy='median')
...      X3 = imp.fit_transform(X)
...      X3
Out[14]: array([[ 3.5, 0. , 3. ],
                [ 2. , 9. , -8. ],
                [ 1. , 4. , 1. ],
                [ 5. , 2. , 4. ],
                [ 7. , 6. , -3. ]])

Let's double-check the math one more time. This time, we won't calculate the mean of the first column but the median (without including X[0, 0]), and we will compare the result to X3[0, 0]. We find that the two values are the same, convincing us that the imputer works as expected:

In [15]: np.median(X[1:, 0]), X3[0, 0]
Out[15]: (3.5, 3.5)

Table of Contents for Handling the missing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Handling the missing data