Missing data

Missing data appears often in real-life data, sometimes randomly in random occurrences, more often because of some bias in its recording and treatment. All linear models work on complete numeric matrices and cannot deal directly with such problems; consequently, it is up to you to take care of feeding suitable data for the algorithm to process.

Even if your initial dataset does not present any missing data, it is still possible to encounter missing values in the production phase. In such a case, the best strategy is surely that of dealing with them passively, as presented at the beginning of the chapter, by standardizing all the numeric variables.

Tip

As for as indicator variables, in order to passively intercept missing values, a possible strategy is instead to encode the presence of the label as 1 and its absence as -1, leaving the zero value for missing values.

When missing values are present from the beginning of the project, it is certainly better to deal with them explicitly, trying to figure out if there is any systematic pattern behind missing values. In Python arrays, upon which both the Pandas and Scikit-learn packages are built, missing values are marked by a special value, Not a Number (NaN), which is repeatable using the value available from the NumPy constant nan.

Creating a toy array with a missing value is easy:

In: import Numpy as np
  example = np.array([1,2,np.nan,4,5])
  print (example)

Out: [  1.   2.  nan   4.   5.]

Also discovering where missing values are in a vector (the result is a vector of Booleans):

In: print (np.isnan(example))

Out: [False False  True False False]

Replacing all missing elements can be done easily by slicing or by the function nan_to_num, which turns every nan to zero:

In: print (np.nan_to_num(example))

Out: [ 1.  2.  0.  4.  5.]

Using slicing, you could decide to use something more sophisticated than a constant, such as the mean of valid elements in the vector:

In: missing = np.isnan(example)
  replacing_value = np.mean(example[~missing])
  example[missing] = replacing_value
  print (example)

Out: [ 1.  2.  3.  4.  5.]

Missing data imputation

Consistency of treatment between samples of data is essential when working with predictive models. If you replace the missing values with a certain constant or a particular mean, that should be consistent during both the training and the production phase. The Scikit-learn package offers the Imputer class in the preprocessing module which can learn a solution by the fit method and then consistently apply it by the transform one.

Let's start demonstrating it after putting some missing values in the Boston dataset:

In: from random import sample, seed
  import numpy as np
  seed(19)
  Xm = X.copy()
  missing = sample(range(len(y)), len(y)//4)
  Xm[missing,5] = np.nan
  print ("Header of Xm[:,5] : %s" % Xm[:10,5])

Out: Header of Xm[:,5] : [ 6.575    nan  7.185    nan  7.147  6.43   6.012  6.172    nan  6.004]

Tip

It is quite unlikely that you will get the same result due to the random nature of the sampling process. Please notice that the exercise sets a seed so you can count on the same results on your PC.

Now about a quarter of observations in the variable should be missing. Let's use Imputer to replace them using a mean:

In: from sklearn.preprocessing import Imputer
  impute = Imputer(missing_values = 'NaN', strategy='mean', axis=1)
  print ("Header of imputed Xm[:,5] : %s" % impute.fit_transform(Xm[:,5])[0][:10])

Out: Header of imputed Xm[:,5] : [ 6.575    6.25446  7.185    6.25446  7.147    6.43     6.012    6.172  6.25446  6.004  ]

Imputer allows you to define any value as missing (sometimes in a re-elaborated dataset missing values could be encoded with negative values or other extreme values) and to choose alternative strategies rather than the mean. Other alternatives are median and mode. The median is useful if you suspect that outlying values are influencing and biasing the average (in house prices, some very expensive and exclusive houses or areas could be the reason). Mode, the most frequent value, is instead the optimal choice if you are working with discrete values (for instance a sequence of integer values with a limited range).

Keeping track of missing values

If you suspect that there is some bias in the missing value pattern, by imputing them you will lose any trace of it. Before imputing, a good practice is to create a binary variable recording where all missing values were and to add it as a feature to the dataset. As seen before, it is quite easy using NumPy to create such a new feature, transforming the Boolean vector created by isnan into a vector of integers:

In: missing_indicator = np.isnan(Xm[:,5]).astype(int)
  print ("Header of missing indicator : %s" \% missing_indicator[:10])

Out: Header of missing indicator : [0 1 1 0 0 0 0 0 1 1]

The linear regression model will create a coefficient for this indicator of missing values and, if any pattern exists, its informative value will be captured by a coefficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.248.149