Best practice 7 - decide on whether or not to select features and if so, how

In Chapter 6, Click-Through Prediction with Logistic Regression, we saw how feature selection was performed using L1-based regularized logistic regression and random forest. Benefits of feature selection include:

  • Reducing training time of prediction models, as redundant or irrelevant features are eliminated
  • Reducing overfitting for the previous reason
  • Likely improving performance as prediction models will learn from data with more significant features

Note that, we used the word likely because there is no absolute certainty that feature selection will increase prediction accuracy. It is therefore good practice to compare the performances of conducting feature selection and not doing so via cross-validation. As an example, in the following snippet we measure the effects of feature selection by estimating the averaged classification accuracy with an SVC model in a cross-validation manner:

First we load the handwritten digits dataset from scikit-learn:

>>> from sklearn.datasets import load_digits
>>> dataset = load_digits()
>>> X, y = dataset.data, dataset.target
>>> print(X.shape)
(1797, 64)

Next, estimate the accuracy on the original dataset, which is 64 dimensional:

>>> from sklearn.svm import SVC
>>> from sklearn.model_selection import cross_val_score
>>> classifier = SVC(gamma=0.005)
>>> score = cross_val_score(classifier, X, y).mean()
>>> print('Score with the original data set: {0:.2f}'.format(score))
Score with the original data set: 0.88

Then, conduct feature selection based on random forest and sort features based on their importancy scores:

>>> from sklearn.ensemble import RandomForestClassifier
>>> random_forest = RandomForestClassifier(n_estimators=100, criterion='gini', n_jobs=-1)
>>> random_forest.fit(X, y)
>>> feature_sorted = np.argsort(random_forest.feature_importances_)

Now, select a different number of top features to construct a new dataset, and estimate the accuracy on each dataset:

>>> K = [10, 15, 25, 35, 45]
>>> for k in K:
... top_K_features = feature_sorted[-k:]
... X_k_selected = X[:, top_K_features]
... # Estimate accuracy on the data set with k selected
features
... classifier = SVC(gamma=0.005)
... score_k_features =
cross_val_score(classifier, X_k_selected, y).mean()
... print('Score with the data set of top {0} features:
{1:.2f}'.format(k, score_k_features))
...
Score with the data set of top 10 features: 0.88
Score with the data set of top 15 features: 0.93
Score with the data set of top 25 features: 0.94
Score with the data set of top 35 features: 0.92
Score with the data set of top 45 features: 0.88
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.253