How to do it...

Execute the following steps to evaluate the feature importance of a Random Forest model.

Import the libraries:

from sklearn.base import clone 
from eli5.sklearn import PermutationImportance

Extract the classifier and preprocessor from the pipeline:

rf_classifier = rf_pipeline.named_steps['classifier']
preprocessor = rf_pipeline.named_steps['preprocessor']

Recover feature names from the preprocessing transformer, and transform the training data:

feat_names = preprocessor.named_transformers_['categorical'] 
                         .named_steps['onehot'] 
                         .get_feature_names(
    input_features=cat_features
)
feat_names = np.r_[num_features, feat_names]

X_train_preprocessed = pd.DataFrame(
    preprocessor.transform(X_train), 
    columns=feat_names
)

Extract the default feature importance and calculate the cumulative importance:

rf_feat_imp = pd.DataFrame(rf_classifier.feature_importances_,
                           index=feat_names,
                           columns=['mdi'])
rf_feat_imp = rf_feat_imp.sort_values('mdi', ascending=False)
rf_feat_imp['cumul_importance_mdi'] = np.cumsum(rf_feat_imp.mdi)

Define a function for plotting the top x features in terms of their importance:

def plot_most_important_features(feat_imp, method='MDI', 
                                 n_features=10, bottom=False):
    if bottom:
        indicator = 'Bottom'
        feat_imp = feat_imp.sort_values(ascending=True)
    else:
        indicator = 'Top'
        feat_imp = feat_imp.sort_values(ascending=False)
        
    ax = feat_imp.head(n_features).plot.barh()
    ax.invert_yaxis()
    ax.set(title=('Feature importance - '
                  f'{method} ({indicator} {n_features})'), 
           xlabel='Importance', 
           ylabel='Feature')
    
    return ax

We use the function as follows:

plot_most_important_features(rf_feat_imp.mdi, 
                             method='MDI')

Running the code results in the following plot:

The list of most important features is dominated by numerical ones: limit_balance, age, and various bill statements from the previous months.

Plot the cumulative importance of the features:

x_values = range(len(feat_names))

fig, ax = plt.subplots()

ax.plot(x_values, rf_feat_imp.cumulative_importance_mdi, 'b-')
ax.hlines(y = 0.95, xmin=0, xmax=len(x_values), 
          color = 'g', linestyles = 'dashed')
ax.set(title='Cumulative Importances', 
       xlabel='Variable', 
       ylabel='Importance')

Running the code results in the following plot:

The top 10 features account for 56.61% of the total importance. The top 26 features account for 95% of the total importance.

Calculate and plot permutation importance:

perm = PermutationImportance(rf_classifier, n_iter = 25,   
                             random_state=42)
perm.fit(X_train_preprocessed, y_train)
rf_feat_imp['permutation'] = perm.feature_importances_

Plot the results using the custom function:

plot_most_important_features(rf_feat_imp.permutation, 
                             method='Permutation')

Running the code results in the following plot:

We can see that the set of the most important features re-shuffled, in comparison to the default ones. Limit balance is still very important; however, the most important now is payment_status_aug_Unknown, which is an undefined label (not assigned a clear meaning in the original paper) in the payment_status_aug categorical feature. A lot of bill_statement features were also replaced with previous_payment variables.

Define a function for calculating the drop column feature importance:

def drop_col_feat_imp(model, X_train, y_train, random_state = 42):
    
    model_clone = clone(model)
    model_clone.random_state = random_state
    model_clone.fit(X, y)
    benchmark_score = model_clone.score(X, y)
    
    importances = []
    
    for col in X.columns:
        model_clone = clone(model)
        model_clone.random_state = random_state
        model_clone.fit(X.drop(col, axis = 1), y)
        drop_col_score = model_clone.score(X.drop(col, axis = 1), 
                                           y)
        importances.append(benchmark_score - drop_col_score)

    
    return importances

Calculate and plot the drop column feature importance:

rf_feat_imp['drop_column'] = drop_col_feat_imp(
    rf_classifier,
    X_train_preprocessed, 
    y_train, 
    random_state = 42
)

First, plot the top 10 most important features:

plot_most_important_features(rf_feat_imp.drop_column, 
                             method='Drop column')

Running the code results in the following plot:

The limit_balance and age numerical features are, again, the most important. Another feature that appeared in the top selection is a Boolean variable, representing gender (male).

Then, plot the 10 least important features:

plot_most_important_features(rf_feat_imp.drop_column, 
                             method='Drop column', 
                             bottom=True)

Running the code results in the following plot:

In the case of drop column feature importance, negative importance indicates that removing a given feature from the model actually improves the performance (in terms of the default score metric). We can use these results to remove features that have negative importance and thus potentially improve the model's performance and/or reduce the training time.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...