In the previous chapter, we did a form of data preprocessing by filtering out stopwords. Some machine learning algorithms have trouble with data that is not distributed as a Gaussian with a mean of 0 and variance of 1. The sklearn.preprocessing
module takes care of this issue. We will be demonstrating it in this section. We will preprocess the meteorological data from the Dutch KNMI institute (original data for De Bilt weather station from http://www.knmi.nl/climatology/daily_data/datafiles3/260/etmgeg_260.zip). The data is just one column of the original datafile and contains daily rainfall values. It is stored in the .npy
format discussed in Chapter 5, Retrieving, Processing, and Storing Data. We can load the data into a NumPy array. The values are integers that we have to multiply by 0.1 to get the daily precipitation amounts in mm.
The data has the somewhat quirky feature that values below 0.05 mm are quoted as -1. We will set those values equal to 0.025 (0.05 divided by 2). Values are missing for some days in the original data. We will completely ignore the missing data. We can do that because we have a lot of data points as it is. Data is missing for about a year at the beginning of the century and for a couple of days later in the century. The preprocessing
module has an Imputer
class with default strategies to deal with missing values. Those strategies, however, seem inappropriate in this case. Data analysis is about looking through data as if it is a window—window to knowledge. Data cleaning and imputing are activities that can make our window nicer to look at. However, we should be careful not to distort the original data too much.
The main feature for the machine learning examples will be an array of day-of-the-year values (1 to 366). This should help explain any seasonal effects.
The mean, variance, and output from the Anderson-Darling test (see Chapter 3, Statistics and Linear Algebra) are printed as follows:
Rain mean 2.17919594267 Rain variance 18.803443919 Anderson rain (inf, array([ 0.576, 0.656, 0.787, 0.918, 1.092]), array([ 15. , 10. , 5. , 2.5, 1. ]))
We can safely conclude that the data doesn't have a 0 mean and variance of 1, and it does not conform to a normal distribution. The data has a large percentage of 0 values corresponding to days on which it didn't rain. Large amounts of rain are increasingly rare (which is a good thing). However, the data distribution is completely asymmetric and therefore not Gaussian. We can easily arrange for a 0 mean and variance of 1. Scale the data with the scale()
function:
scaled = preprocessing.scale(rain)
We now get the required values for the mean and variance, but the data distribution remains asymmetric:
Scaled mean 3.41301602808e-17 Scaled variance 1.0 Anderson scaled (inf, array([ 0.576, 0.656, 0.787, 0.918, 1.092]), array([ 15. , 10. , 5. , 2.5, 1. ]))
Sometimes, we want to convert numerical feature values into Boolean values. This is often used in text analysis in order to simplify computation. Perform the conversion with the binarize()
function:
binarized = preprocessing.binarize(rain) print np.unique(binarized), binarized.sum()
By default, a new array is created; we could have also chosen to perform the operation in-place. The default threshold is at zero, meaning that positive values are replaced by 1 and negative values by 0:
[ 0. 1.] 24594.0
The LabelBinarizer
class can label integers as classes (in the context of classification):
lb = preprocessing.LabelBinarizer() lb.fit(rain.astype(int)) print lb.classes_
The output is a list of integers from 0 to 62. Refer to the preproc.py
file in this book's code bundle:
import numpy as np from sklearn import preprocessing from scipy.stats import anderson rain = np.load('rain.npy') rain = .1 * rain rain[rain < 0] = .05/2 print "Rain mean", rain.mean() print "Rain variance", rain.var() print "Anderson rain", anderson(rain) scaled = preprocessing.scale(rain) print "Scaled mean", scaled.mean() print "Scaled variance", scaled.var() print "Anderson scaled", anderson(scaled) binarized = preprocessing.binarize(rain) print np.unique(binarized), binarized.sum() lb = preprocessing.LabelBinarizer() lb.fit(rain.astype(int)) print lb.classes_
18.216.232.11