Best practice 8 – deciding on whether or not to select features, and if so, how to do so

We have seen in Chapter 7, Predicting Online Ads Click-through with Logistic Regression, where feature selection was performed using L1-based regularized logistic regression and random forest. The benefits of feature selection include the following:

Reducing the training time of prediction models, as redundant, or irrelevant features are eliminated
Reducing overfitting for the preceding same reason
Likely improving performance as prediction models will learn from data with more significant features

Note we used the word likely because there is no absolute certainty that feature selection will increase prediction accuracy. It is therefore good practice to compare the performances of conducting feature selection and not doing so via cross-validation. For example, by executing the following steps, we can measure the effects of feature selection by estimating the averaged classification accuracy with an SVC model in a cross-validation manner:

First, we load the handwritten digits dataset from scikit-learn, as follows:

>>> from sklearn.datasets import load_digits
>>> dataset = load_digits()
>>> X, y = dataset.data, dataset.target
>>> print(X.shape)
(1797, 64)

Next, estimate the accuracy of the original dataset, which is 64 dimensional, as detailed here:

>>> from sklearn.svm import SVC
>>> from sklearn.model_selection import cross_val_score
>>> classifier = SVC(gamma=0.005)
>>> score = cross_val_score(classifier, X, y).mean()
>>> print('Score with the original data set: 
                                {0:.2f}'.format(score))
Score with the original data set: 0.88

Then conduct feature selection based on random forest and sort the features based on their importance scores:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100, 
                                      criterion='gini', n_jobs=-1)
>>> random_forest.fit(X, y)
>>> feature_sorted = 
               np.argsort(random_forest.feature_importances_)

Now select a different number of top features to construct a new dataset, and estimate the accuracy on each dataset, as follows:

>>> K = [10, 15, 25, 35, 45]
>>> for k in K:
...     top_K_features = feature_sorted[-k:]
...     X_k_selected = X[:, top_K_features]
...     # Estimate accuracy on the data set with k 
          selected features
...     classifier = SVC(gamma=0.005)
...     score_k_features = 
               cross_val_score(classifier, X_k_selected, y).mean()
...     print('Score with the data set of top {0} features: 
                            {1:.2f}'.format(k, score_k_features))
...
Score with the data set of top 10 features: 0.88
Score with the data set of top 15 features: 0.93
Score with the data set of top 25 features: 0.94
Score with the data set of top 35 features: 0.92
Score with the data set of top 45 features: 0.88

Table of Contents for Best practice 8&#xA0;&#x2013; deciding on whether or not to select features, and if so, how to do so

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practice 8 – deciding on whether or not to select features, and if so, how to do so