Best practice 9 – deciding on whether or not to reduce dimensionality, and if so, how to do so

Feature selection and dimensionality are different in the sense that the former chooses features from the original data space, while the latter does so from a projected space from the original space. Dimensionality reduction has the following advantages that are similar to feature selection, as follows:

  • Reducing the training time of prediction models, as redundant, or correlated features are merged into new ones
  • Reducing overfitting for the same reason as previously
  • Likely improving performance as prediction models will learn from data with less redundant or correlated features

Again, it is not guaranteed that dimensionality reduction will yield better prediction results. In order to examine its effects, integrating dimensionality reduction in the model training stage is recommended. Reusing the preceding handwritten digits example, we can measure the effects of principal component analysis (PCA)-based dimensionality reduction, where we keep a different number of top components to construct a new dataset, and estimate the accuracy on each dataset:

>>> from sklearn.decomposition import PCA
>>> # Keep different number of top components
>>> N = [10, 15, 25, 35, 45]
>>> for n in N:
... pca = PCA(n_components=n)
... X_n_kept = pca.fit_transform(X)
... # Estimate accuracy on the data set with top n components
... classifier = SVC(gamma=0.005)
... score_n_components =
cross_val_score(classifier, X_n_kept, y).mean()
... print('Score with the data set of top {0} components:
{1:.2f}'.format(n, score_n_components))
Score with the data set of top 10 components: 0.95
Score with the data set of top 15 components: 0.95
Score with the data set of top 25 components: 0.91
Score with the data set of top 35 components: 0.89
Score with the data set of top 45 components: 0.88
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.70.170