Best practice 7 - decide on whether or not to select features and if so, how

In Chapter 6, Click-Through Prediction with Logistic Regression, we saw how feature selection was performed using L1-based regularized logistic regression and random forest. Benefits of feature selection include:

Reducing training time of prediction models, as redundant or irrelevant features are eliminated
Reducing overfitting for the previous reason
Likely improving performance as prediction models will learn from data with more significant features

Note that, we used the word likely because there is no absolute certainty that feature selection will increase prediction accuracy. It is therefore good practice to compare the performances of conducting feature selection and not doing so via cross-validation. As an example, in the following snippet we measure the effects of feature selection by estimating the averaged classification accuracy with an SVC model in a cross-validation manner:

First we load the handwritten digits dataset from scikit-learn:

>>> from sklearn.datasets import load_digits
>>> dataset = load_digits()
>>> X, y = dataset.data, dataset.target
>>> print(X.shape)
(1797, 64)

Next, estimate the accuracy on the original dataset, which is 64 dimensional:

>>> from sklearn.svm import SVC
>>> from sklearn.model_selection import cross_val_score
>>> classifier = SVC(gamma=0.005)
>>> score = cross_val_score(classifier, X, y).mean()
>>> print('Score with the original data set: {0:.2f}'.format(score))
Score with the original data set: 0.88

Then, conduct feature selection based on random forest and sort features based on their importancy scores:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100, criterion='gini', n_jobs=-1)
>>> random_forest.fit(X, y)
>>> feature_sorted = np.argsort(random_forest.feature_importances_)

Now, select a different number of top features to construct a new dataset, and estimate the accuracy on each dataset:

>>> K = [10, 15, 25, 35, 45]
>>> for k in K:
...     top_K_features = feature_sorted[-k:]
...     X_k_selected = X[:, top_K_features]
...     # Estimate accuracy on the data set with k selected 
          features
...     classifier = SVC(gamma=0.005)
...     score_k_features = 
               cross_val_score(classifier, X_k_selected, y).mean()
...     print('Score with the data set of top {0} features: 
                              {1:.2f}'.format(k, score_k_features))
...
Score with the data set of top 10 features: 0.88
Score with the data set of top 15 features: 0.93
Score with the data set of top 25 features: 0.94
Score with the data set of top 35 features: 0.92
Score with the data set of top 45 features: 0.88

Table of Contents for Best practice 7 - decide on whether or not to select features and if so, how

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practice 7 - decide on whether or not to select features and if so, how