Missing data appears often in real-life data, sometimes randomly in random occurrences, more often because of some bias in its recording and treatment. All linear models work on complete numeric matrices and cannot deal directly with such problems; consequently, it is up to you to take care of feeding suitable data for the algorithm to process.
Even if your initial dataset does not present any missing data, it is still possible to encounter missing values in the production phase. In such a case, the best strategy is surely that of dealing with them passively, as presented at the beginning of the chapter, by standardizing all the numeric variables.
When missing values are present from the beginning of the project, it is certainly better to deal with them explicitly, trying to figure out if there is any systematic pattern behind missing values. In Python arrays, upon which both the Pandas and Scikit-learn packages are built, missing values are marked by a special value, Not a Number (NaN), which is repeatable using the value available from the NumPy constant nan
.
Creating a toy array with a missing value is easy:
In: import Numpy as np example = np.array([1,2,np.nan,4,5]) print (example) Out: [ 1. 2. nan 4. 5.]
Also discovering where missing values are in a vector (the result is a vector of Booleans):
In: print (np.isnan(example)) Out: [False False True False False]
Replacing all missing elements can be done easily by slicing or by the function nan_to_num
, which turns every nan
to zero:
In: print (np.nan_to_num(example)) Out: [ 1. 2. 0. 4. 5.]
Using slicing, you could decide to use something more sophisticated than a constant, such as the mean of valid elements in the vector:
In: missing = np.isnan(example) replacing_value = np.mean(example[~missing]) example[missing] = replacing_value print (example) Out: [ 1. 2. 3. 4. 5.]
Consistency of treatment between samples of data is essential when working with predictive models. If you replace the missing values with a certain constant or a particular mean, that should be consistent during both the training and the production phase. The Scikit-learn package offers the Imputer
class in the preprocessing
module which can learn a solution by the fit
method and then consistently apply it by the transform
one.
Let's start demonstrating it after putting some missing values in the Boston dataset:
In: from random import sample, seed import numpy as np seed(19) Xm = X.copy() missing = sample(range(len(y)), len(y)//4) Xm[missing,5] = np.nan print ("Header of Xm[:,5] : %s" % Xm[:10,5]) Out: Header of Xm[:,5] : [ 6.575 nan 7.185 nan 7.147 6.43 6.012 6.172 nan 6.004]
Now about a quarter of observations in the variable should be missing. Let's use Imputer
to replace them using a mean:
In: from sklearn.preprocessing import Imputer impute = Imputer(missing_values = 'NaN', strategy='mean', axis=1) print ("Header of imputed Xm[:,5] : %s" % impute.fit_transform(Xm[:,5])[0][:10]) Out: Header of imputed Xm[:,5] : [ 6.575 6.25446 7.185 6.25446 7.147 6.43 6.012 6.172 6.25446 6.004 ]
Imputer allows you to define any value as missing (sometimes in a re-elaborated dataset missing values could be encoded with negative values or other extreme values) and to choose alternative strategies rather than the mean. Other alternatives are median and mode. The median is useful if you suspect that outlying values are influencing and biasing the average (in house prices, some very expensive and exclusive houses or areas could be the reason). Mode, the most frequent value, is instead the optimal choice if you are working with discrete values (for instance a sequence of integer values with a limited range).
If you suspect that there is some bias in the missing value pattern, by imputing them you will lose any trace of it. Before imputing, a good practice is to create a binary variable recording where all missing values were and to add it as a feature to the dataset. As seen before, it is quite easy using NumPy to create such a new feature, transforming the Boolean vector created by isnan
into a vector of integers:
In: missing_indicator = np.isnan(Xm[:,5]).astype(int) print ("Header of missing indicator : %s" \% missing_indicator[:10]) Out: Header of missing indicator : [0 1 1 0 0 0 0 0 1 1]
The linear regression model will create a coefficient for this indicator of missing values and, if any pattern exists, its informative value will be captured by a coefficient.
18.119.248.149