Investigating the feature importance

We have already spent quite some time on creating the entire pipeline and tuning the models to achieve better performance. However, what is equally—or even more—important is the model's interpretability: so, not only giving an accurate prediction but also being able to explain the why. In the case of customer churn, an accurate model is important. However, knowing what are the actual predictors of the customers leaving might be helpful in improving the overall service and, potentially, making them stay longer. In a financial setting, banks often use machine learning in order to predict a customer's ability to repay credit. And, in many cases, they are obliged to justify their reasoning, in case they decline a credit application—why exactly this customer's application was not approved. In the case of very complicated models, this might be hard, or even impossible.

By knowing the features' importance, we can benefit in multiple ways, for example:

By understanding the model's logic, we can theoretically verify its correctness (if a sensible feature is a good predictor), but also, try to improve the model by focusing only on the important variables.
We can use the feature importances to only keep the x most important features (contributing to a specified percentage of total importance), which can not only lead to better performance but also to shorter training time.
In some real-life cases, it makes sense to sacrifice some accuracy (or any other performance metric) for the sake of interpretability.

It is also important to be aware that the more accurate (in terms of a specified performance metric) the model is, the more reliable the feature importances are. That is why we investigate the importance of the features after tuning the models.

In this recipe, we show how to calculate the feature importance on an example of a Random Forest classifier. However, most of the methods are model-agnostic. In other cases, there are often equivalent approaches (such as in the case of XGBoost and LightGBM). We briefly present three methods of calculating feature importance.

Scikit-learn's feature importance: The default feature importance used by Random Forest is the mean decrease in impurity. As we know, decision trees use a metric of impurity to create the best splits while growing. When training a decision tree, we can compute how much each feature contributes to decreasing the weighted impurity. To calculate the importance for the entire trees, the algorithm averages the decrease in impurity over all the trees.

Here are the advantages of this approach:

Fast calculation
Easy to retrieve

Here are the disadvantages of this approach:

Biased—It tends to inflate the importance of continuous (numerical) features or high-cardinality categorical variables. This can sometimes lead to absurd cases, whereby an additional random variable (unrelated to the problem at hand) scores high in the feature importance ranking.
Impurity-based importances are calculated on the basis of the training set and do not reflect the model's ability to generalize to unseen data.

Permutation feature importance: This approach directly measures feature importance. It observes how random re-shuffling of each predictor influences the model's performance. Additionally, the re-shuffling preserves the distribution of the variables.

The steps of the algorithm are:

Train the baseline model and record the score of interest (this can be done on either the training set or on the validation set, to gain some insight about the model's ability to generalize).
Randomly permute (re-shuffle) the values of one of the features, use the entire dataset (with one re-shuffled feature) to obtain predictions, and record the score. The feature importance is the difference between the baseline score and the one from the permuted dataset.
Repeat the second step for all features.

Here are the advantages of this approach:

Model-agnostic
Reasonably efficient—no need to retrain the model at every step

Here are the disadvantages of this approach:

Computationally more expensive than the default feature importances
Overestimates the importance of correlated predictors

Drop column feature importance: The idea behind this approach is very simple. We compare a model with all the features to a model with one of the features dropped for training and inference. We repeat this process for all the features.

Here are the advantages of this approach:

Most accurate/reliable feature importance

Here are the disadvantages of this approach:

Potentially high computation cost, caused by retraining the model for each variant of the dataset

Table of Contents for Investigating the feature importance

Create new playlist

Sign In

Sign Up

Table of Contents for
Investigating the feature importance