Preprocessing

In the previous chapter, we did a form of data preprocessing by filtering out stopwords. Some machine learning algorithms have trouble with data that is not distributed as a Gaussian with a mean of 0 and variance of 1. The sklearn.preprocessing module takes care of this issue. We will be demonstrating it in this section. We will preprocess the meteorological data from the Dutch KNMI institute (original data for De Bilt weather station from http://www.knmi.nl/climatology/daily_data/datafiles3/260/etmgeg_260.zip). The data is just one column of the original datafile and contains daily rainfall values. It is stored in the .npy format discussed in Chapter 5, Retrieving, Processing, and Storing Data. We can load the data into a NumPy array. The values are integers that we have to multiply by 0.1 to get the daily precipitation amounts in mm.

The data has the somewhat quirky feature that values below 0.05 mm are quoted as -1. We will set those values equal to 0.025 (0.05 divided by 2). Values are missing for some days in the original data. We will completely ignore the missing data. We can do that because we have a lot of data points as it is. Data is missing for about a year at the beginning of the century and for a couple of days later in the century. The preprocessing module has an Imputer class with default strategies to deal with missing values. Those strategies, however, seem inappropriate in this case. Data analysis is about looking through data as if it is a window—window to knowledge. Data cleaning and imputing are activities that can make our window nicer to look at. However, we should be careful not to distort the original data too much.

The main feature for the machine learning examples will be an array of day-of-the-year values (1 to 366). This should help explain any seasonal effects.

The mean, variance, and output from the Anderson-Darling test (see Chapter 3, Statistics and Linear Algebra) are printed as follows:

Rain mean 2.17919594267
Rain variance 18.803443919
Anderson rain (inf, array([ 0.576,  0.656,  0.787,  0.918,  1.092]), array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

We can safely conclude that the data doesn't have a 0 mean and variance of 1, and it does not conform to a normal distribution. The data has a large percentage of 0 values corresponding to days on which it didn't rain. Large amounts of rain are increasingly rare (which is a good thing). However, the data distribution is completely asymmetric and therefore not Gaussian. We can easily arrange for a 0 mean and variance of 1. Scale the data with the scale() function:

scaled = preprocessing.scale(rain)

We now get the required values for the mean and variance, but the data distribution remains asymmetric:

Scaled mean 3.41301602808e-17
Scaled variance 1.0
Anderson scaled (inf, array([ 0.576,  0.656,  0.787,  0.918,  1.092]), array([ 15. ,  10. ,   5. ,   2.5,   1. ]))

Sometimes, we want to convert numerical feature values into Boolean values. This is often used in text analysis in order to simplify computation. Perform the conversion with the binarize() function:

binarized = preprocessing.binarize(rain)
print np.unique(binarized), binarized.sum()

By default, a new array is created; we could have also chosen to perform the operation in-place. The default threshold is at zero, meaning that positive values are replaced by 1 and negative values by 0:

[ 0.  1.] 24594.0

The LabelBinarizer class can label integers as classes (in the context of classification):

lb = preprocessing.LabelBinarizer()
lb.fit(rain.astype(int))
print lb.classes_

The output is a list of integers from 0 to 62. Refer to the preproc.py file in this book's code bundle:

import numpy as np
from sklearn import preprocessing
from scipy.stats import anderson


rain = np.load('rain.npy')
rain = .1 * rain
rain[rain < 0] = .05/2
print "Rain mean", rain.mean()
print "Rain variance", rain.var()
print "Anderson rain", anderson(rain)

scaled = preprocessing.scale(rain)
print "Scaled mean", scaled.mean()
print "Scaled variance", scaled.var()
print "Anderson scaled", anderson(scaled)

binarized = preprocessing.binarize(rain)
print np.unique(binarized), binarized.sum()

lb = preprocessing.LabelBinarizer()
lb.fit(rain.astype(int))
print lb.classes_
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.232.11