How it works...

In the first step, we imported the required libraries. Then, we extracted the classifier and the ColumnTransformer preprocessor from the pipeline. In this recipe, we worked with a vanilla Random Forest classifier, without any hyperparameter tuning. The reason for this is that it gives a decent performance and, additionally, the computation time is much lower in comparison to the tuned one (with a higher number of estimators, and so on).

Please see the accompanying Notebook in the book's GitHub repository to see how to extract the best pipeline from a fitted grid search, or how to assign the best hyperparameter values manually.

In Step 3, we extracted the column names from the ColumnTransformer preprocessor. It is important to concatenate them (for that, we used np.r_) in the correct order, that is, in the same order in which they were specified in the ColumnTransformer. In our case, we first transformed numerical, and then categorical features. We also applied the preprocessing steps to the X_train DataFrame.

In Step 4, we extracted the default feature importances (calculated using the mean decrease in impurity) using the feature_importances_ method of the fitted classifier. The values were normalized so that they added up to 1. Additionally, we calculated the cumulative feature importance.

In Step 5, we defined a function for plotting the most/least important features and plotted the top 10 most important features, calculated using the mean decrease in impurity (MDI).

In Step 6, we plotted the cumulative importance of all the features. Using this plot, we could decide if we wanted to reduce the number of features in the model to account for a certain percentage of total importance and by doing so, potentially decrease the model's training time.

In the next step, we calculated the permutation feature importance using PermutationImportance from the eli5 library.

At the time of writing this chapter, there is a plan to add a permutation importance functionality to the inspection module of scikit-learn. However, as it is in the development version, we do not cover it here. Additionally, an implementation of permutation importance can be found in the rfpimp library.

We call PermutationImportance with the default value of the cv argument ('prefit'), which is the most useful one for calculating the importance of an existing estimator, and it does not require fitting the model again. Using cv='prefit', we can pass either the training data or the validation set (in case we want to know the importances for generalization) to the fit method of PermutationImportance. We set the n_iter argument to 25, so the algorithm re-shuffles each feature 25 times, and the resulting score is the average of the re-shuffles. We can also set a different scoring metric than the default score method of the estimator (please see the documentation for more information).

In Step 8, we defined a function for calculating the drop column feature importance. We used the clone function of scikit-learn to create a copy of the model with the exact same specification as the baseline one. We then iteratively trained the model on a dataset without one feature, and compared the performance using the default score method. A possible extension would be to use a custom scoring metric instead (such as recall, or any other metric suitable for imbalanced data).

In the last step, we applied the drop column feature importance function and plotted the results, both the most and the least important features.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...