Greedy selection of features

By following our experiments throughout the book, you may have noticed that adding new variables is always a great success in a linear regression model. That's especially true for training errors and it happens not just when we insert the right variables but also when we place the wrong ones. Puzzlingly, when we add redundant or non-useful variables, there is always a more or less positive impact on the fit of the model.

The reason is easily explained; since regression models are high-bias models, they find it beneficial to augment their complexity by increasing the number of coefficients they use. Thus, some of the new coefficients can be used to fit the noise and other details present in data. It is precisely the memorization/overfitting effect we discussed before. When you have as many coefficients as observations, your model can become saturated (that's the technical term used in statistics) and you could have a perfect prediction because basically you have a coefficient to learn every single response in the training set.

Let's make this concept more concrete with a quick example using a training set (in-sample observations) and a test set (out-sample observations). Let's start by finding out how many cases and features we have and what the baseline performance is (for both in-sample and out-sample):

In: from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=3)
lm = LinearRegression()
lm.fit(X_train,y_train)
print ('Train (cases, features) = %s' % str(X_train.shape))
print ('Test  (cases, features) = %s' % str(X_test.shape))
print ('In-sample  mean squared error %0.3f' % mean_squared_error(
        y_train,lm.predict(X_train)))
print ('Out-sample mean squared error %0.3f' % mean_squared_error(
        y_test,lm.predict(X_test)))

Out:   Train (cases, features) = (354, 13)
Test  (cases, features) = (152, 13)
In-sample  mean squared error 22.420
Out-sample mean squared error 22.440

Tip

The best approach would be to use a cross validation or bootstrap for such an experiment, not just a plain train/test split, but we want to make it fast, and that's the reason why we decided on such a solution. We assure you that using more sophisticated estimation techniques doesn't change the results of the experiment.

Therefore, we have similar in-sample and out-sample errors. We can start working on improving our model using polynomial expansions:

In: from sklearn.preprocessing import PolynomialFeatures
second_order=PolynomialFeatures(degree=2, interaction_only=False)
third_order=PolynomialFeatures(degree=3, interaction_only=True)

First, we apply the second-order polynomial expansion:

In: lm.fit(second_order.fit_transform(X_train),y_train)
print ('(cases, features) = %s' % str(second_order.fit_transform(
            X_train).shape))
print ('In-sample  mean squared error %0.3f' %
mean_squared_error(y_train,lm.predict(second_order.fit_transform(
            X_train))))
print ('Out-sample mean squared error %0.3f' %
mean_squared_error(y_test,lm.predict(second_order.fit_transform(
            X_test))))
Out:   (cases, features) = (354, 105)
In-sample  mean squared error 5.522
Out-sample mean squared error 12.034

It seems that the good in-sample results have little correspondence with the out-sample test. Though the out-sample performance has improved, the lack of comparability in results is a clear sign of overfitting; there are some more useful coefficients in the model but most of them are just there to catch noise in data.

We now go to extremes and we test the third-degree polynomial expansion (using only interactions though):

In: lm.fit(third_order.fit_transform(X_train), y_train)
print ('(cases, features) = %s' % str(third_order.fit_transform(
            X_train).shape))
print ('In-sample  mean squared error %0.3f' %
mean_squared_error(y_train,lm.predict(third_order.fit_transform(
            X_train))))
print ('Out-sample mean squared error %0.3f' %
mean_squared_error(y_test,lm.predict(third_order.fit_transform(
            X_test))))

Out:   (cases, features) = (354, 378)
In-sample  mean squared error 0.438
Out-sample mean squared error 85777.890

Now, clearly something very bad has happened to our model. Having more coefficients than observations (p>n), we achieved a perfect fit on our training set. However, on the out-sample validation, it seems that our model achieved the same performance as a random number generator. In the next few paragraphs, we will show you how to take advantage of an increased number of features without incurring any of the problems demonstrated by the previous code snippets.

The Madelon dataset

For the task of selecting the best subset of variables among many noisy and collinear ones, we decided to accompany our usual Boston house dataset with a tricky one, the Madelon dataset (https://archive.ics.uci.edu/ml/datasets/Madelon). It is an artificial dataset (that is generated using an algorithm) presented at the NIPS 2003 (the seventh Annual Conference on Neural Information Processing Systems) during a contest on feature selection.

The dataset is particularly challenging because it has been generated by placing 32 distinct clusters of points (16 from the positive group, 16 from the negative one) on the vertices of a five-dimension hypercube. The resulting 500 features and 2,000 cases have been extracted from various transformations of the five metric dimensions. To make things harder, some random numbers have been added to the features to act as noise and a few responses have been flipped (the flipped ones amount to 1%). All these intricate transformations make dealing with the modeling quite difficult, especially for linear models, since the relationship of most of the features with the response is definitely non-linear. This is really helpful for our exemplification because it clearly demonstrates how a direct inclusion of all the features is detrimental to the accuracy of out-of-sample predictions.

To download and make available on your computer such an interesting and challenging dataset, please carry out the following instructions and allow some time for your computer to download the data from the external website where it is stored:

In: try:
import urllib.request as urllib2
  except:
    import urllib2
  import numpy as np
  train_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
  validation_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_valid.data'
train_response = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'
  validation_response = 'https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/madelon_valid.labels'
try:
      Xt = np.loadtxt(urllib2.urlopen(train_data))
      yt = np.loadtxt(urllib2.urlopen(train_response))
      Xv = np.loadtxt(urllib2.urlopen(validation_data))
      yv = np.loadtxt(urllib2.urlopen(validation_response))
except:
    # In case downloading the data doesn't works, 
# just manually download the files into the working directory
      Xt = np.loadtxt('madelon_train.data')
      yt = np.loadtxt('madelon_train.labels')
      Xv = np.loadtxt('madelon_valid.data')
      yv = np.loadtxt('madelon_valid.labels')

After finishing loading both the training and validation sets, we can start exploring some of the information available:

In: print ('Training set: %i observations %i feature' % (Xt.shape))
  print ('Validation set: %i observations %i feature' % (Xv.shape))
Out:  Training set: 2000 observations 500 feature
    Validation set: 600 observations 500 feature 

Naturally, we won't touch the validation set (we won't even glance at it or it would be snooping), but we can try to figure out the situation with the training set:

In:from scipy.stats import describe
  print (describe(Xt))

The output is quite lengthy and it is put in matrix form (therefore it is not reported here), but it really tells us everything about the mean, min, max, variance, skewness, and kurtosis for each feature in the dataset. A fast glance through it doesn't reveal anything special; however, it explicits that all the variables have an approximately normal distribution and that they have a limited range of values. We can proceed with our exploration using a graphical representation of correlations among the variables:

import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

def visualize_correlation_matrix(data, hurdle = 0.0):
    R = np.corrcoef(data, rowvar=0)
    R[np.where(np.abs(R)<hurdle)] = 0.0
    heatmap = plt.pcolor(R, cmap=mpl.cm.coolwarm, alpha=0.8)
    heatmap.axes.set_frame_on(False)
    plt.xticks(rotation=90)
    plt.tick_params(axis='both', which='both', bottom='off',
                  top='off', left = 'off',right = 'off')
    plt.colorbar()
    plt.show()

visualize_correlation_matrix(Xt[:,100:150], hurdle=0.0)

Check the following screenshot:

The Madelon dataset

After a glance at a portion of the features and their respective correlation, we can notice that just a couple of them have a significant correlation, whereas the others are mildly related. This gives the impression of noisy relationships between them, thus rendering an effective selection quite complicated.

As a last step, we check how a simple logistic regression model would score in terms of the error measured using the area under the curve metric.

Tip

Area under the curve (AUC) is a measure derived from comparing the rate of correct positive results against the rate of incorrect ones at different classification thresholds. It is a bit tricky to calculate, so we suggest always relying on the roc_auc_score function from the sklearn.metrics module.

Logistic regression classifies an observation as positive if the threshold is over 0.5 since such a split is always proved to be optimal, but we can freely change that threshold. To increase the precision at a top selection of results we just raise the threshold from 0.5 to 1.0 (raising the threshold increases the accuracy in the selected range). Instead, if we intend to increase the total number of guessed positive cases we just choose a threshold inferior to 0.5 down to almost 0.0 (lowering the threshold increases the coverage of positive cases in the selected range).

The AUC error measure helps us determine whether our predictions are ordered properly, no matter their effective precision in terms of value. Thus, AUC is the ideal error measure to evaluate an algorithm for selection. If you order results properly based on probability, no matter if the guessed probability is correct or not, you can simply pick the correct selection to be used by your project by changing the 0.5 threshold—that is, by taking a certain number of the top results.

In our case, the baseline AUC measure is 0.602, a quite disappointing value since a random selection should bring us a 0.5 value (1.0 is the maximum possible):

In: from sklearn.cross_validation import cross_val_score
  from sklearn.linear_model import LogisticRegression
  logit = LogisticRegression()
  logit.fit(Xt,yt)

  from sklearn.metrics import roc_auc_score
  print ('Training area under the curve: %0.3f' % 
oc_auc_score(yt,logit.predict_proba(Xt)[:,1]))
  print ('Validation area under the curve: %0.3f' % 
oc_auc_score(yv,logit.predict_proba(Xv)[:,1]))

Out:   Training area under the curve: 0.824
    Validation area under the curve: 0.602

Univariate selection of features

Given this new tricky dataset, we really have to deal with a large number of either irrelevant or redundant features. Hopefully, after removing them all, our model should increase its performance, and this is indicated by the large difference in AUC score between the in-sample and out-sample values that we got from our last code snippet.

In addition, if interpreting your model is a valuable addition, you really should remove non-useful variables, striving for the simplest possible form of your linear model as dictated by Occam's razor, a commonplace practice in science, favoring simpler solutions against more complex ones when their difference in performance is not marked.

Feature selection can help in both increasing the model out-sample performance and its human readability by retaining only the most predictive set of variables in the model, in some cases just the best ones and in others the set that works the best in unison.

There are quite a few feature selection methods. The simplest approach is the univariate method, which evaluates how good a variable is by estimating its predictive value when taken alone in respect of the response.

This usually involves using statistical tests, and Scikit-learn offers three possible tests:

  • The f_regression class, which works out an F-test (a statistical test for comparing different regression solutions) and a p-value (interpretable as the probability value in which we observed a difference by chance) and reveals the best features for a regression
  • The f_class, which is an Anova F-test (a statistical test for comparing differences among classes), another statistical and related method that will prove useful for classification problems
  • The Chi2 class, which is a chi-squared test (a statistical test on count data), a good choice when your problem is classification and your answer variable is a count or a binary (in every case, a positive number such as units sold or money earned)

All such tests output a score and a statistical test expressed by a p-value. High scores, confirmed by small p-values (under 0.05, indicating a low probability that the score has been obtained by luck), will provide you with confirmation that a certain variable is useful for predicting your target.

In our example, we will use f_class (since we are working on a classification problem now) and we will have the SelectPercentile function help us by selecting a certain percentage of high-scoring features:

In: from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=50)
selector.fit(Xt,yt)
variable_filter = selector.get_support()

After selecting the upper half, hoping to have cut off the most irrelevant features and to have kept the important ones, we plot our results on an histogram to reveal the distribution of the scores:

In: plt.hist(selector.scores_, bins=50, histtype='bar')
plt.grid()
plt.show()

Look at the following screenshot:

Univariate selection of features

Noticeably, most scores are near zero, with a few high-ranking ones. Now we are going to pick the features we assume to be important by directly selecting a threshold empirically chosen for its convenience:

  In: variable_filter = selector.scores_ > 10
  print ("Number of filtered variables: %i" %  np.sum(variable_filter))
  from sklearn.preprocessing import PolynomialFeatures
  interactions = PolynomialFeatures(degree=2, interaction_only=True)
  Xs = interactions.fit_transform(Xt[:,variable_filter])
  print ("Number of variables and interactions: %i" % Xs.shape[1])

Out:   Number of filtered variables: 13
Number of variables and interactions: 92

Now, we have reduced our dataset to just the core features. At this point, it does make sense to test a polynomial expansion and try to automatically catch any relevant non-linear relationship in our model:

In: logit.fit(Xs,yt)
Xvs = interactions.fit_transform(Xv[:,variable_filter])
 print ('Validation area Under the Curve ' + 
        'before recursive  selection:   %0.3f' % 
        roc_auc_score(yv,logit.predict_proba(Xvs)[:,1]))

Out:   Validation area Under the Curve before 
recursive selection: 0.808

The resulting validation score (out-sample) is about 0.81, a very promising value given our initial overfitted score of 0.82 on the training set. Of course, we can decide to stop here or try to go on filtering even the polynomial expansion; feature selection is really a never-ending job, though after a certain point you have to realize that only slightly incremental results are possible from further tuning.

Recursive feature selection

The only problem with univariate selection is that it will decide the best features by considering each feature separately from the others, not verifying how they work together in unison. Consequently, redundant variables are not infrequently picked (due to collinearity).

A multivariate approach, such as recursive elimination, can avoid this problem; however, it is more computationally expensive.

Recursive elimination works by starting with the full model and by trying to exclude each variable in turn, evaluating the removal effect by cross-validation estimation. If certain variables have a negligible effect on the model's performance, then the elimination algorithm just prunes them. The process stops when any further removal is proven to hurt the ability of the model to predict correctly.

Here is a demonstration of how RFECV, Scikit-learn's implementation of recursive elimination, works. We will use the Boston dataset enhanced by second-degree polynomial expansion, thus working on a regression problem this time:

In: from sklearn.feature_selection import RFECV
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = 
    train_test_split(X, y, test_size=0.30, random_state=1)
    
lm = LinearRegression()
cv_iterator = KFold(
    n=len(X_train), n_folds=10, shuffle=True, random_state=101)
recursive_selector = RFECV(estimator=lm, step=1, cv=cv_iterator, scoring='mean_squared_error')
recursive_selector.fit(second_order.fit_transform(X_train),
y_train)
print ('Initial number of features : %i' % 
       second_order.fit_transform(X_train).shape[1])
print ('Optimal number of features : %i' % 
       recursive_selector.n_features_)

Out:   Initial number of features : 105
Optimal number of features : 52

Given an estimator (our model), a cross validation iterator, and an error measure, RFECV will find out after a while that half of the features can be dropped from the model without fear of worsening its performance:

In: essential_X_train = recursive_selector.transform(
    second_order.fit_transform(X_train))
essential_X_test  = recursive_selector.transform(
    second_order.fit_transform(X_test))
lm.fit(essential_X_train, y_train)
print ('cases = %i features = %i' % essential_X_test.shape)
print ('In-sample  mean squared error %0.3f' %  
mean_squared_error(y_train,lm.predict(essential_X_train)))
print ('Out-sample mean squared error %0.3f' % 
mean_squared_error(y_test,lm.predict(essential_X_test)))

Out:   cases = 152 features = 52
In-sample  mean squared error 7.834
Out-sample mean squared error 11.523

A test-based check will reveal that now the out-sample performance is 11.5. For further confirmation, we can also run a cross-validation and obtain a similar result:

In: edges = np.histogram(y, bins=5)[1]
binning = np.digitize(y, edges)
stratified_cv_iterator = StratifiedKFold(binning, n_folds=10, shuffle=True, random_state=101)
essential_X = recursive_selector.transform(
    second_order.fit_transform(X))
cv_score = cross_val_score(
    lm, essential_X, y, cv=stratified_cv_iterator, 
    scoring='mean_squared_error', n_jobs=1)
print ('Cv score: mean %0.3f std %0.3f' % (np.mean(np.abs(cv_score)), np.std(cv_score)))

Out: Cv score: mean 11.400 std 3.779
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.55.151