© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
P. MishraExplainable AI Recipes https://doi.org/10.1007/978-1-4842-9029-3_4

4. Explainability for Ensemble Supervised Models

Pradeepta Mishra1  
(1)
Bangalore, Karnataka, India
 

Ensemble models are considered to be effective when individual models are failing to balance bias and variance for a training dataset. The predictions are aggregated in ensemble models to generate the final models. In the case of supervised regression models, many models are generated, and the averages of all the predictions are taken into consideration to generate the final prediction. Similarly, for supervised classification problems, multiple models are being trained, and each model generates a classification. The final model takes into account the majority voting rule criteria to decide the final prediction. Because of the nature of ensemble models, these are harder to explain to end users. That is why we need frameworks that can explain the ensemble models.

Ensemble means a grouping of the model predictions. There are three types of ensemble models: bagging, boosting, and stacking. Bagging means bootstrap aggregation, which means bootstrapping the available features, making a subset selection, generating predictions, continuing the same process a few times, and averaging the predictions to generate the final prediction. Random forest is one of the most important and popular bagging models.

Boosting is a sequential method of boosting the predictive power of the model. It starts with a base classifier being trained on data to predict and classify the output. In the next step, the correctly predicted cases are separated in an automatic fashion, and the rest of the cases are used for retraining the model. This process will continue until there is a scope to improve and boost the accuracy to a higher level. If it is not possible to boost the accuracy further, then the iteration should stop, and the final accuracy is reported.

Stacking is a process of generating predictions from different sets of models and averaging their predictions.

The goal of this chapter is to introduce various explainability libraries for ensemble models such as feature importance, partial dependency plot, and local interpretation and global interpretation of the models.

Recipe 4-1. Explainable Boosting Machine Interpretation

Problem

You want to explain the explainable boosting machine (EBM) as an ensemble model and interpret the global and local interpretations.

Solution

EBMs are a tree-based, cyclic, gradient descent–based boosting model known as a generalized additive model (GAM), which has automatic interaction detection. EBMs are interpretable though they are black box by nature. We need an additional library known as interpret core.

How It Works

Let’s take a look at the following example. The Shapely value can be called the SHAP value. SHAP value is used to explain the model and is used for the impartial distribution of predictions from a cooperative game theory to attribute a feature to the model’s predictions. The model input features are considered as players in the game. The model function is considered as the rules of the game. The Shapely value of a feature is computed based on the following steps:
  1. 1.

    SHAP requires model retraining on all feature subsets; hence, usually it takes time if the explanation has to be generated for larger datasets.

     
  2. 2.

    Identify a feature set from a list of features (let’s say there are 15 features; we can select a subset with 5 features).

     
  3. 3.

    For any particular feature, two models using the subset of features will be created, one with the feature and another without the feature.

     
  4. 4.

    The prediction differences will be computed.

     
  5. 5.

    The differences in prediction are computed for all possible subsets of features.

     
  6. 6.

    The weighted average value of all possible differences is used to populate the feature importance.

     

If the weight of the feature is 0.000, then we can conclude that the feature is not important and has not joined the model. If it is not equal to 0.000, then we can conclude that the feature has a role to play in the prediction process.

We are going to use a dataset from the UCI machine learning repository. The URL to access the dataset is as follows:

https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

The objective is to predict the appliances’ energy use in Wh, using the features from sensors. There are 27 features in the dataset, and here we are trying to understand what features are important in predicting the energy usage. See Table 4-1.
Table 4-1

Feature Description from the Energy Prediction Dataset

Feature Name

Description

Unit

Appliances

Energy use

In Wh

Lights

Energy use of light fixtures in the house

In Wh

T1

Temperature in kitchen area

In Celsius

RH_1

Humidity in kitchen area

In %

T2

Temperature in living room area

In Celsius

RH_2

Humidity in living room area

In %

T3

Temperature in laundry room area

RH_3

Humidity in laundry room area

In %

T4

Temperature in office room

In Celsius

RH_4

Humidity in office room

In %

T5

Temperature in bathroom

In Celsius

RH_5

Humidity in bathroom

In %

T6

Temperature outside the building (north side)

In Celsius

RH_6

Humidity outside the building (north side)

In %

T7

Temperature in ironing room

In Celsius

RH_7

Humidity in ironing room

In %

T8

Temperature in teenager room 2

In Celsius

RH_8

Humidity in teenager room 2

In %

T9

Temperature in parents room

In Celsius

RH_9

Humidity in parents room

In %

To

Temperature outside (from the Chievres weather station)

In Celsius

Pressure (from Chievres weather station)

 

In mm Hg

aRH_out

Humidity outside (from the Chievres weather station)

In %

Wind speed (from Chievres weather station)

 

In m/s

Visibility (from Chievres weather station)

 

In km

Tdewpoint (from Chievres weather station)

 

°C

rv1

Random variable 1

Nondimensional

rv2

Random variable 2

Nondimensional

pip install shap
!pip install interpret-core #this installation is without any dependency library
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: interpret-core in /usr/local/lib/python3.7/dist-packages (0.2.7)
import pandas as pd
df_lin_reg = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv')
del df_lin_reg['date']
df_lin_reg.info()
df_lin_reg.columns
Index(['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2'], dtype='object')
#y is the dependent variable, that we need to predict
y = df_lin_reg.pop('Appliances')
# X is the set of input features
X = df_lin_reg
# fit a GAM model to the data
import interpret.glassbox
import shap
model_ebm = interpret.glassbox.ExplainableBoostingRegressor()
model_ebm.fit(X, y)
X100 = X[:100]
# explain the GAM model with SHAP
explainer_ebm = shap.Explainer(model_ebm.predict, X100)
shap_values_ebm = explainer_ebm(X100)
import numpy as np
pd.DataFrame(np.round(shap_values_ebm.values,2)).head(2)

A table with 27 columns and 2 rows. The columns are numbered from 0 to 26 and the rows are labeled 0 and 1.

pd.DataFrame(np.round(shap_values_ebm.base_values,2)).head(2)
00103.741103.74

Recipe 4-2. Partial Dependency Plot for Tree Regression Models

Problem

You want to get a partial dependency plot from a boosting model.

Solution

The solution to this problem is to use a partial dependency plot from the model using SHAP.

How It Works

Let’s take a look at the following example (see Figure 4-1):
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 20
fig,ax = shap.partial_dependence_plot(
    "lights", model_ebm.predict, X100, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_ebm[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot has SHAP values versus lights. A solid line starts from (0, 60) and ends at (70, 180) in an increasing step trend. 2 dotted lines E lights, and E f of x intersect at (15,105). A point is indicated on the solid line at (20, 120). The highest bar is at (0, 105). All values are estimated.

Figure 4-1

Correlation between feature light and predicted output of the model

The correlation between the feature lights and the predicted value of the model energy usage is shown, and the steps show a nonlinear pattern. The partial dependency plot is a way to explain the individual predictions and generate local interpretations for the sample selected from the dataset. See Figure 4-2.
shap.plots.scatter(shap_values_ebm[:,"lights"])

A scatterplot of SHAP value for lights versus lights. The dot clusters have an increasing trend between approximately (0, 0) and (60, 80).

Figure 4-2

Correlation between the feature lights and the SHAP values

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values_ebm[sample_ind], max_display=14)

A SHAP waterfall plot has feature values of 14 variables. T 3 and T 9 have the highest and the lowest SHAP values of 155.02 and negative 193.5, respectively. The dotted lines for f of x and E f of x are at 194.184 and 103.74 respectively.

Figure 4-3

Feature importance for a specific sample record, local interpretation

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.beeswarm(shap_values_ebm, max_display=14)

A SHAP bee swarm plot. A gradient scale indicates the shades of high to low feature values. The low feature values of T 9 and the sum of 14 other features have the maximum negative and positive SHAP values of around negative 250 and 280, respectively.

Figure 4-4

SHAP values’ impact on the model output, global exxplanation

To generate the global explainer, we need to install another visualization library.
!pip install dash_cytoscape
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from interpret import show
ebm_global = model_ebm.explain_global()
show(ebm_global)

A screenshot of the graph component selection screen with a dropdown box on the top. R H underscore 1 graph below plots score versus numbers and has a fluctuating almost straight line between (25, negative 0.1) and (65, 0.1) approximately.

Figure 4-5

Selecting features from the drop-down to see its contribution

A screenshot of the explainable booting regressor section. A line graph R H underscore 1 plots the score and has a straight line with slight fluctuations. A density histogram with 21 ranges has the highest bar at almost (37.4 through 39.1, 3000).

Figure 4-6

Score of feature RH_1 and its distribution, global interpretation

ebm_local = model_ebm.explain_local(X[:5], y[:5])
show(ebm_local)
ebm_local
import numpy as np
pd.DataFrame(np.round(shap_values_ebm.values,2)).head(2)
pd.DataFrame(np.round(shap_values_ebm.base_values,2)).head(2)

Recipe 4-3. Explain a Extreme Gradient Boosting Model with All Numerical Input Variables

Problem

You want to explain the extreme gradient boosting–based regressor.

Solution

The XGB regressor can be explained using the SHAP library; we can populate the global and local interpretations.

How It Works

Let’s take a look at the following example:
# train XGBoost model
import xgboost
model_xgb = xgboost.XGBRegressor(n_estimators=100, max_depth=2).fit(X, y)
# explain the GAM model with SHAP
explainer_xgb = shap.Explainer(model_xgb, X)
shap_values_xgb = explainer_xgb(X)
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 18
fig,ax = shap.partial_dependence_plot(
    "lights", model_xgb.predict, X, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_xgb[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot has SHAP values versus lights. The solid line starts from (0, 90) and rises to (70, 130) in an increasing step trend. 2 dotted lines E lights, and E f of x intersect at (4, 98). A point is indicated at (20, 120). The highest bar value is at (0, 135). All values are estimated.

Figure 4-7

SHAP value–based feature importance plot taken from the summary plot

The XGB regressor–based model provides the summary plot that contains the SHAP value impact on the model output. If we need to explain the global importance of the features using the SHAP values, which shows what features are important for all data points, we can use the summary plot.
shap.plots.scatter(shap_values_xgb[:,"lights"])

A scatterplot of SHAP value for lights versus lights. The majority of the dots are clustered between (8, 15) and (50, 60) approximately, with an increasing trend.

Figure 4-8

SHAP values of the lights feature plotted against the lights feature

shap.plots.scatter(shap_values_xgb[:,"lights"], color=shap_values_xgb)

A scatter plot of SHAP value for lights and T 8 versus lights. A gradient scale on the right indicates the shades of T 8 values between 19 and 25. A majority of dots in gradient shades are clustered between (5, 18) and (50, 60) approximately.

Figure 4-9

Scatter plot of two features, T8 and lights, against the SHAP values of light

shap.summary_plot(shap_values_xgb, X)

A SHAP bee swarm plot. A gradient scale indicates the shades of high to low feature values. Press underscore m m underscore h g has the maximum positive and negative SHAP values of almost 100 and negative 40 for the low and high feature values, respectively.

Figure 4-10

Global feature importance based on the SHAP value

Recipe 4-4. Explain a Random Forest Regressor with Global and Local Interpretations

Problem

Random forest (RF) is a bagging approach to create ensemble models; it is also difficult to interpret which tree generated the final prediction and interpret the global and local interpretations.

Solution

We are going to use the tree explainer from the SHAP library.

How It Works

Let’s take a look at the following example:
import shap
from sklearn.ensemble import RandomForestRegressor
rforest = RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_split=20, random_state=0)
rforest.fit(X, y)
# explain all the predictions in the test set
explainer = shap.TreeExplainer(rforest)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

A SHAP bee swarm plot. A gradient scale indicates the shades of high to low feature values. The lights feature has the maximum positive and negative SHAP value of almost 200 and negative 50 for its high and low feature values respectively.

Figure 4-11

SHAP value impact on model prediction

shap.dependence_plot("lights", shap_values, X)

A SHAP scatterplot of SHAP value for lights and T 8 versus lights. A gradient scale indicates the shades of T 8 values between 19 and 25. The dots with a low T 8 and maximum SHAP values are at approximately (20, 200) and (30, 200). High T 8 and minimum SHAP values are at almost (0, 0).

Figure 4-12

SHAP value of lights plotted against T8 and lights

# explain all the predictions in the dataset
shap.force_plot(explainer.expected_value, shap_values, X)
shap.partial_dependence_plot(
    "lights", rforest.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot has SHAP values versus lights. The solid line starts at (0, 50), moves straight to (5, 50), rises sharply till (6, 133), and then ends at (70, 134). The highest bar is at (0, 125). 2 dotted lines E lights, and E f of x intersect at (5, 98). All values are estimated.

Figure 4-13

Partial dependency plot of lights

Recipe 4-5. Explain the Catboost Regressor with Global and Local Interpretations

Problem

Catboost is another model that fasten the model training process by explicitly declaring the categorical features. If there is no categorical feature, then the model is trained on all numeric features as well. You want to explain the global and local interpretations from the catboost regression model.

Solution

We are going to use the tree explainer from the SHAP library and the catboost library.

How It Works

Let’s take a look at the following example:
!pip install catboost
import catboost
from catboost import *
import shap
shap.initjs()
model = CatBoostRegressor(iterations=100, learning_rate=0.1, random_seed=123)
model.fit(X, y, verbose=False, plot=False)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# summarize the effects of all the features
shap.summary_plot(shap_values, X)

A SHAP bee swarm plot. The high feature values of R H underscore 1 and press underscore m m underscore h g have the maximum positive and negative SHAP values of almost 100 and negative 30, respectively.

Figure 4-14

SHAP value impact on model predictions

# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("lights", shap_values, X)

A scatter plot of SHAP value for lights and T 8 versus lights. The dots with the maximum SHAP values and low T 8 values are at approximately (30, 85) and (40, 85). High T 8 and minimum SHAP values are at almost (0, negative 1).

Figure 4-15

SHAP value of lights dependence plot

Recipe 4-6. Explain the EBM Classifier with Global and Local Interpretations

Problem

EBM is an explainable boosting machine for a classifier. You want to explain the global and local interpretations from the EBM classifier model.

Solution

We are going to use the tree explainer from the SHAP library.

How It Works

Let’s take a look at the following example. We are going to use a public automobile dataset with some modifications. The objective is to predict the price of a vehicle given the features such as make, location, age, etc. It is a regression problem that we are going to solve using a mix of numeric and categorical features.
df = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/automobile.csv')
df.head(3)
df.columns
Index(['Price', 'Make', 'Location', 'Age', 'Odometer', 'FuelType', 'Transmission', 'OwnerType', 'Mileage', 'EngineCC', 'PowerBhp'], dtype='object')
We cannot use the string-based features or categorical features in the model directly because matrix multiplication is not possible on string features; hence, the string-based features need to be transformed into dummy variables or binary features with 0 and 1 flags. The transformation step is skipped here as many data scientists already know how to do data transformation. We are importing another transformed dataset directly.
df_t = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/Automobile_transformed.csv')
del df_t['Unnamed: 0']
df_t.head(3)
df_t.columns
Index(['Price', 'Age', 'Odometer', 'mileage', 'engineCC', 'powerBhp', 'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore', 'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur', 'Location_Kochi', 'Location_Kolkata', 'Location_Mumbai', 'Location_Pune', 'FuelType_Diesel', 'FuelType_Electric', 'FuelType_LPG', 'FuelType_Petrol', 'Transmission_Manual', 'OwnerType_Fourth +ACY- Above', 'OwnerType_Second', 'OwnerType_Third'], dtype='object')
#y is the dependent variable, that we need to predict
y = df_t.pop('Price')
# X is the set of input features
X = df_t
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
import pandas as pd
from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
import shap
import sklearn
To compute the SHAP values, we can use the explainer function with the training dataset X and the model predict function. The SHAP value calculation takes place using a permutation approach; it took 5 minutes.
# fit a GAM model to the data
import interpret.glassbox
import shap
model_ebm = interpret.glassbox.ExplainableBoostingRegressor()
model_ebm.fit(X, y)
X100 = X[:100]
# explain the GAM model with SHAP
explainer_ebm = shap.Explainer(model_ebm.predict, X100)
shap_values_ebm = explainer_ebm(X100)
import numpy as np
pd.DataFrame(np.round(shap_values_ebm.values,2)).head(2)
pd.DataFrame(np.round(shap_values_ebm.base_values,2)).head(2)

Recipe 4-7. SHAP Partial Dependency Plot for Regression Models with Mixed Input

Problem

You want to plot the partial dependency plot and interpret the graph for numeric and categorical dummy variables.

Solution

The partial dependency plot shows the correlation between a feature and the predicted output of the target variables. There are two ways we can showcase the results, one with a feature and expected value of the prediction function and the other by superimposing a data point on the partial dependency plot.

How It Works

Let’s take a look at the following example:
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from interpret import show
ebm_global = model_ebm.explain_global()
show(ebm_global)
ebm_local = model_ebm.explain_local(X[:5], y[:5])
show(ebm_local)
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 20
fig,ax = shap.partial_dependence_plot(
    "powerBhp", model_ebm.predict, X100, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_ebm[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot has SHAP values versus power b h p. The line rises between (0, 5) and (300, 44) with multiple sharp peaks and ends at (500, 40). A point is indicated at (200, 17.5). The highest bar is at (75, 10). 2 dotted lines E power b h p, and E f of x intersect at (110, 8). All values are estimated.

Figure 4-16

Nonlinear relationship between the powerBhp and the predicted output from the model

The nonlinear blue line shows the positive correlation between the price and the powerBhp. The powerBhp is a strong feature; the higher the bhp, the higher the price of the car.
shap.partial_dependence_plot(
    "powerBhp", model_ebm.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot has SHAP values versus power b h p. The line rises between (0, 5) and (300, 45) with multiple sharp peaks and ends at (600, 40). The highest bar is at (75, 17). 2 dotted lines E power b h p, and E f of x intersect at (110, 9). All values are estimated.

Figure 4-17

Partial dependence plot of powerBhp

This is a continuous or numeric feature. Let’s look at the binary or dummy features. There are two dummy features for if the car is registered in Bangalore or in Kolkata.
shap.partial_dependence_plot(
    "Location_Bangalore", model_ebm.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot has SHAP values versus location Bangalore. A straight horizontal line is between (negative 0.1, 9.49) and (1.0, 9.49). Two bars are at (0.0, 9.9) and (1.0, 9.04). 2 dotted lines E location Bangalore and E f of x intersect at (0.8, 9.49). All values are estimated.

Figure 4-18

Dummy variable Bangalore versus SHAP value

If the location of the car is Bangalore, then the price will be 9.5, and it remains constant.
shap.partial_dependence_plot(
    "Location_Kolkata", model_ebm.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot has the SHAP values versus location Kolkata. A straight horizontal line is between (negative 0.1, 9.49) and (1.0, 9.49). Two bars are at (0.0, 9.9) and (1.0, 9.05). 2 dotted lines E location Kolkata and E f of x intersect at (0.8, 9.49). All values are estimated.

Figure 4-19

Dummy variable Location_Kolkata versus SHAP value

If the location is Kolkata, then the price is expected to be the same. There is no impact of the dummy variable on the price.

Recipe 4-8. SHAP Feature Importance for Tree Regression Models with Mixed Input Variables

Problem

You want to get the global feature importance from SHAP values using mixed input feature data.

Solution

The solution to this problem is to use absolute values, sort them in descending order, and populate them in waterfall chart, beeswarm chart, scatter plot, etc.

How It Works

Let’s take a look at the following example:
shap.plots.scatter(shap_values_ebm[:,"powerBhp"])

A scatterplot and a histogram of SHAP values versus power B H P. The dot cluster rises in a curve almost between (50, negative 5) and (200, 14). The histogram has the highest bar at around (100, 15).

Figure 4-20

Scatter plot of powerBhp and its SHAP values

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values_ebm[sample_ind], max_display=14)

A SHAP waterfall plot of features versus SHAP values. Power b h p has the highest SHAP value of 9.84 and the Kochi location has the lowest SHAP value of negative 0.54. The lines f of x and E f of x are at 21.319 and 8.383 respectively. All values are estimated.

Figure 4-21

Feature importance for a specific example

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.beeswarm(shap_values_ebm, max_display=14)

A SHAP bee swarm plot. The high feature values of power B H P and age have the maximum and the minimum SHAP values of 30 and negative 10, respectively. All values are estimated.

Figure 4-22

Importance of SHAP values on model prediction

# explain all the predictions in the dataset
shap.summary_plot(shap_values_ebm, X100)

A SHAP bee swarm plot with all features. The high feature values of power B H P and age have the maximum and the minimum SHAP values of 30 and negative 10, respectively. The feature values of the rest are mostly at 0. All values are estimated.

Figure 4-23

Explaining all predictions with feature importance

At a high level, for the tree-based nonlinear model that is used to predict the price of the automobiles, the previous features are important. The highest is the powerBhp, age of the car, petrol type, manual transmission type, etc. The previous tabular output shows the global feature importance.

Recipe 4-9. Explaining the XGBoost Model

Problem

You want to generate explainability for an XGBoost model for regression.

Solution

The XGBoost regressor trained on 100 trees and with a max depth parameter of 3 using a dataset that contains both numerical and categorial features. The total number of features are 23; an ideal dataset for XGBoost would be where we have more than 50 features. However, that requires more computation time.

How It Works

Let’s take a look at the following example:
# train XGBoost model
import xgboost
model_xgb = xgboost.XGBRegressor(n_estimators=100, max_depth=2).fit(X, y)
# explain the GAM model with SHAP
explainer_xgb = shap.Explainer(model_xgb, X)
shap_values_xgb = explainer_xgb(X)
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 18
fig,ax = shap.partial_dependence_plot(
    "powerBhp", model_xgb.predict, X, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_xgb[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot of SHAP values versus power b h p. The line rises between (0, 5) and (300, 42), moves straight and ends at (600, 50). A point is indicated at (80, 6). The highest bar is at (80, 15). The lines for E power b h p and E f of x intersect at (110, 10). All values are estimated.

Figure 4-24

Partial dependency plot with a sample

shap.plots.scatter(shap_values_xgb[:,"mileage"])

A scatterplot has SHAP values versus mileage. The dots cluster is horizontal between (12, 0) and (30, 0). The dots at 10 on the X-axis spread between 0 and 2 on the Y-axis. The highest bar is at (17,0). All values are estimated.

Figure 4-25

Mileage feature and its SHAP values

shap.plots.scatter(shap_values_xgb[:,"powerBhp"], color=shap_values_xgb)

A scatterplot has SHAP value and age versus power B H P. A gradient shade line has age values between 2 and 12. The dots rise in a curve between (50, negative 5) and (300, 40), and are then horizontal till beyond 500. The maximum value is for an age 8 dot at (600, 45). All values are estimated

Figure 4-26

Scatter plot of powerBhp, age, and SHAP value of powerBhp

shap.summary_plot(shap_values_xgb, X)

A SHAP bee swarm plot for model predictions. The high feature values of power b h p and age have the maximum and the minimum SHAP values of 45 and negative 12, respectively. All values are estimated.

Figure 4-27

SHAP value impact on model predictions

Recipe 4-10. Random Forest Regressor for Mixed Data Types

Problem

You want to generate explainability for a random forest model using numeric as well as categorical features.

Solution

Random forest is useful when we have more features, say, more than 50; however, for this recipe, it is applied on 23 features. We could pick up a large dataset, but that would require more computations and may take more time to train. So, be cognizant about the model configurations when the model is being trained on a smaller machine.

How It Works

Let’s take a look at the following example:
import shap
from sklearn.ensemble import RandomForestRegressor
rforest = RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_split=20, random_state=0)
rforest.fit(X, y)
# explain all the predictions in the test set
explainer = shap.TreeExplainer(rforest)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

A SHAP bee swarm plot for model output. The high feature values of power B H P and age have the maximum and the minimum SHAP values of almost 35 and negative 10, respectively. The feature values of the rest are mostly at 0.

Figure 4-28

SHAP value impact on model output

shap.dependence_plot("powerBhp", shap_values, X)

A scatter plot has SHAP values and age versus power B H P. The dots of different age values rise between almost (50, negative 5) and (300, 40) and then are horizontal till beyond 500.

Figure 4-29

SHAP dependence plot

shap.partial_dependence_plot(
    "mileage", rforest.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot of SHAP values versus mileage. The line moves straight between (5, 9.59) and (7.5, 9.59), declines sharply till (7.5, 9.49), and then moves straight up to (35, 0.49). The highest bar is at (17, 9.49). The lines for E milage and E f of x intersect at (19, 9.49). All values are estimated.

Figure 4-30

Partial dependency plot of mileage

Recipe 4-11. Explaining the Catboost Model

Problem

You want to generate explainability for a dataset where most of the features are categorical. We can use a boosting model where a lot of variables are categorical.

Solution

The catboost model is known to work when we have more categorical variables compared to numeric variables. Hence, we can use the catboost regressor.

How It Works

Let’s take a look at the following example:
!pip install catboost
import catboost
from catboost import *
import shap
model = CatBoostRegressor(iterations=100, learning_rate=0.1, random_seed=123)
model.fit(X, y, verbose=False, plot=False)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
# summarize the effects of all the features
shap.summary_plot(shap_values, X)

A SHAP bee swarm plot. The high feature values of power b h p and age have the maximum and the minimum SHAP values of almost 39 and negative 14, respectively. The low feature values of most variables are at 0.

Figure 4-31

SHAP value impact on model predictions

# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("powerBhp", shap_values, X)

A scatter plot has SHAP value and age versus power b h p. The dots for different ages rise between almost (50, negative 5) and (300, 35) and then a few dots scatter horizontally towards the right till beyond 500.

Figure 4-32

SHAP dependence plot

Recipe 4-12. LIME Explainer for the Catboost Model and Tabular Data

Problem

You want to generate explainability at a local level in a focused manner rather than at a global level.

Solution

The solution to this problem is to use the LIME library. LIME is a model-agnostic technique; it retrains the ML model while running the explainer. LIME localizes a problem and explains the model at a local level.

How It Works

Let’s take a look at the following example. LIME requires a numpy array as an input to the tabular explainer; hence, the Pandas dataframe needs to be transformed into an array.
!pip install lime
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
     |████████████████████████████████| 275 kB 3.9 MB/s
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from lime) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from lime) (1.7.3)
Require
................
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X),
                                                   mode='regression',
                                                  feature_names=X.columns,
                                                  class_names=['price'],
                                                  verbose=True)
We are using the energy prediction data from this chapter only.
explainer.feature_selection
# asking for explanation for LIME model
i = 60
exp = explainer.explain_instance(np.array(X)[i],
                                 model.predict,
                                 num_features=14
                                )
model.predict(X)[60]
X[60:61]
Intercept 2.412781377314505
Prediction_local [26.44019841]
Right: 18.91681746836109
exp.show_in_notebook(show_table=True)

A SHAP local bar chart and table of the LIME model. The predicted value is 16.50. A table with 2 columns of feature and value has 14 rows. The minimum and maximum values of bars in the graph are transmission, negative 2.91, and power b h p, 19.96 respectively.

Figure 4-33

Local explanation for the 60th record from the dataset

[('powerBhp > 138.10', 11.685972887206468), ('Age <= 4.00', 5.069171125183003), ('engineCC > 1984.00', 3.2307037317922287), ('0.00 < Transmission_Manual <= 1.00', -2.175314285519644), ('Odometer <= 34000.00', 2.0903883419638976), ('OwnerType_Fourth +ACY- Above <= 0.00', 1.99286243362804), ('Location_Hyderabad <= 0.00', -1.4395857770864107), ('mileage <= 15.30', 1.016369130009493), ('0.00 < FuelType_Diesel <= 1.00', 0.8477072936504322), ('Location_Kolkata <= 0.00', 0.6908993069146472), ('FuelType_Petrol <= 0.00', 0.654629068871846), ('Location_Bangalore <= 0.00', -0.47395963805113284), ('FuelType_Electric <= 0.00', 0.4285429019735695), ('Location_Delhi <= 0.00', 0.40903051200940277)]

Recipe 4-13. ELI5 Explainer for Tabular Data

Problem

You want to use the ELI5 library for generating explanations of a linear regression model.

Solution

ELI5 is a Python package that helps to debug a machine learning model and explain the predictions. It provides support for all machine learning models supported by the scikit-learn library.

How It Works

Let’s take a look at the following example:
pip install eli5
import eli5
eli5.show_weights(model,
                 feature_names=list(X.columns))

Weight

Feature

0.4385

powerBhp

0.2572

Age

0.0976

engineCC

0.0556

Odometer

0.0489

Mileage

0.0396

Transmission_Manual

0.0167

FuelType_Petrol

0.0165

FuelType_Diesel

0.0104

Location_Hyderabad

0.0043

Location_Coimbatore

0.0043

Location_Kolkata

0.0035

Location_Kochi

0.0025

Location_Bangalore

0.0021

Location_Mumbai

0.0014

Location_Delhi

0.0006

OwnerType_Third

0.0003

OwnerType_Second

0.0000

FuelType_Electric

0

OwnerType_Fourth +ACY- Above

0

Location_Pune

eli5.explain_weights(model, feature_names=list(X.columns))

Weight

Feature

0.4385

powerBhp

0.2572

Age

0.0976

engineCC

0.0556

Odometer

0.0489

Mileage

0.0396

Transmission_Manual

0.0167

FuelType_Petrol

0.0165

FuelType_Diesel

0.0104

Location_Hyderabad

0.0043

Location_Coimbatore

0.0043

Location_Kolkata

0.0035

Location_Kochi

0.0025

Location_Bangalore

0.0021

Location_Mumbai

0.0014

Location_Delhi

0.0006

OwnerType_Third

0.0003

OwnerType_Second

0.0000

FuelType_Electric

0

OwnerType_Fourth +ACY- Above

0

Location_Pune

from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model)
perm.fit(X, y)
eli5.show_weights(perm,feature_names=list(X.columns))

Weight

Feature

0.6743 ± 0.0242

powerBhp

0.2880 ± 0.0230

Age

0.1188 ± 0.0068

engineCC

0.0577 ± 0.0028

Transmission_Manual

0.0457 ± 0.0048

Odometer

0.0354 ± 0.0053

mileage

0.0134 ± 0.0018

Location_Hyderabad

0.0082 ± 0.0022

FuelType_Petrol

0.0066 ± 0.0013

FuelType_Diesel

0.0042 ± 0.0010

Location_Kochi

0.0029 ± 0.0006

Location_Kolkata

0.0023 ± 0.0010

Location_Coimbatore

0.0017 ± 0.0002

Location_Bangalore

0.0014 ± 0.0005

Location_Mumbai

0.0014 ± 0.0006

Location_Delhi

0.0007 ± 0.0001

OwnerType_Third

0.0002 ± 0.0000

OwnerType_Second

0.0000 ± 0.0000

FuelType_Electric

0 ± 0.0000

Location_Chennai

0 ± 0.0000

FuelType_LPG

The results table has a BIAS value as a feature. This can be interpreted as an intercept term for a linear regression model. Other features are listed based on their descending order of importance based on their weight. The show weights function provides a global interpretation of the model, and the show prediction function provides local interpretation by taking into account a record from the training set.

Recipe 4-14. How the Permutation Model in ELI5 Works

Problem

You want to make sense of the ELI5 permutation library.

Solution

The solution to this problem is to use a dataset and a trained model.

How It Works

The permutation model in the ELI5 library works only for global interpretation. First it takes a baseline model from the training dataset and computes the error of the model. Then it shuffles the values of a feature, retrains the model, and computes the error. It compares the decrease in error after shuffling and before shuffling. A feature can be considered as important if after shuffling the error delta is high, and unimportant if the error delta is low. The result displays the average importance of features and the standard deviation of features with multiple shuffle steps.

Recipe 4-15. Global Explanation for Ensemble Classification Models

Problem

You want to explain the predictions generated from a classification model using ensemble models.

Solution

The logistic regression model is also known as a classification model as we model the probabilities from either a binary classification or a multinomial classification variable. In this particular recipe, we are using a churn classification dataset that has two outcomes: whether the customer is likely to churn or not. Let’s use the ensemble models such as the explainable boosting machine for the classifier, extreme gradient boosting classifier, random forest classifier, and catboost classifier.

How It Works

Let’s take a look at the following example. The key is to get the SHAP values, which will return base values, SHAP values, and data. Using the SHAP values we can create various explanations using graphs and figures. The SHAP values are always at a global level.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import tree, metrics, model_selection, preprocessing
from sklearn.metrics import confusion_matrix, classification_report
df_train = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/ChurnData_test.csv')
from sklearn.preprocessing import LabelEncoder
tras = LabelEncoder()
df_train['area_code_tr'] = tras.fit_transform(df_train['area_code'])
df_train.columns
del df_train['area_code']
df_train.columns
df_train['target_churn_dum'] = pd.get_dummies(df_train.churn,prefix='churn',drop_first=True)
df_train.columns
del df_train['international_plan']
del df_train['voice_mail_plan']
del df_train['churn']
df_train.info()
del df_train['Unnamed: 0']
df_train.columns
from sklearn.model_selection import train_test_split
X = df_train[['account_length', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_day_charge', 'total_eve_minutes',
       'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
       'total_night_calls', 'total_night_charge', 'total_intl_minutes',
       'total_intl_calls', 'total_intl_charge',
       'number_customer_service_calls', 'area_code_tr']]
Y = df_train['target_churn_dum']
import pandas as pd
from sklearn.model_selection import train_test_split
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.20,stratify=Y)
ebm = ExplainableBoostingClassifier(random_state=12)
ebm.fit(xtrain, ytrain)
ebm_global = ebm.explain_global()
show(ebm_global)
ebm_local = ebm.explain_local(xtest[:5], ytest[:5])
show(ebm_local)
print("training accuracy:", ebm.score(xtrain,ytrain)) #training accuracy
print("test accuracy:",ebm.score(xtest,ytest)) # test accuracy
show(ebm_global)
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from interpret import show
X100 = X[:100]
# explain the GAM model with SHAP
explainer_ebm = shap.Explainer(ebm.predict, X100)
shap_values_ebm = explainer_ebm(X100)
ebm_global = ebm.explain_global()
show(ebm_global)
import numpy as np
pd.DataFrame(np.round(shap_values_ebm.values,2)).head(2)

Recipe 4-16. Partial Dependency Plot for a Nonlinear Classifier

Problem

You want to show feature associations with the class probabilities using a nonlinear classifier.

Solution

The class probabilities in this example are related to predicting the probability of churn. The SHAP value for a feature can be plotted against the feature value to show a scatter chart that displays the correlation, positive or negative, and the strength of associations.

How It Works

Let’s take a look at the following example:
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 20
fig,ax = shap.partial_dependence_plot(
    "number_customer_service_calls", ebm.predict, X100, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_ebm[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot of the number of customer service calls. The line moves horizontally from (0, 0.05), rises vertically at (3.5, 0.05), and ends at (5, 0.5). A point is indicated at (1, 0.4). The lines for E customer calls and E f of x intersect at (1.5, 0.6). The highest bar is at (1, 0.23). All values are estimated.

Figure 4-34

Account length and SHAP value of account length

# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 20
fig,ax = shap.partial_dependence_plot(
    "number_vmail_messages", ebm.predict, X100, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_ebm[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot of the number of voicemail messages. The line moves straight from (negative 2.5, 0.07) and falls vertically at (4, 0.07), moves straight up to (43, 0.02), and ends at (43, 0.04). A point is indicated at (35, 0.35). The lines for E customer calls and E f of x intersect at (8, 0.05). The highest bar is at (0, 0.58). All values are estimated.

Figure 4-35

Number of voicemail messages and SHAP values

Recipe 4-17. Global Feature Importance from the Nonlinear Classifier

Problem

You want to get the global feature importance for the decision tree classification model.

Solution

The solution to this problem is to use the explainer log odds.

How It Works

Let’s take a look at the following example:
shap.plots.scatter(shap_values_ebm)

16 scatterplots of SHAP values of features. The clusters of points are mostly straight at approximately 0.0 SHAP value. A few points move up to 0.6 and 0.8 SHAP values. Each plot has a histogram in the background.

Figure 4-36

All features SHAP values plotted together

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values_ebm[sample_ind], max_display=14)

A SHAP waterfall plot. The maximum value is 0 for features total international charge, night calls, and account length. The minimum value is negative 0.02 for customer service calls. The lines for f of x and E f of x are at 0 and 0.05 respectively. All values are estimated.

Figure 4-37

Local explanation for record 20

The interpretation goes like this: when we change the value of a feature by 1 unit, the model equation will produce two odds; one is the base, and the other is the incremental value of the feature. We are looking at the ratio of odds changes with every increase or decrease in the value of a feature. From the global feature importance, there are three important features: the number of customer service calls, the total day minutes, and the number of voicemail messages.
# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.beeswarm(shap_values_ebm, max_display=14)

A SHAP bee swarm plot. The high feature values of the number of customer service calls and voicemail messages have maximum and minimum values of almost 0.79 and negative 0.4, respectively.

Figure 4-38

SHAP values from EBM model on model predictions

Recipe 4-18. XGBoost Model Explanation

Problem

You want to explain an extreme gradient boosting model, which is a sequential boosting model.

Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. We will take a subset if the smaller machine is allocated and a full dataset if the machine configuration supports it.

How It Works

Let’s take a look at the following example:
# train XGBoost model
import xgboost
model = xgboost.XGBClassifier(n_estimators=100, max_depth=2).fit(X, Y)
# compute SHAP values
explainer = shap.Explainer(model, X)
shap_values = explainer(X)
# make a standard partial dependence plot with a single SHAP value overlaid
sample_ind = 18
fig,ax = shap.partial_dependence_plot(
    "account_length", model.predict, X, model_expected_value=True,
    feature_expected_value=True, show=False, ice=False,
    shap_values=shap_values_xgb[sample_ind:sample_ind+1,:]
)

A SHAP partial dependence plot of account length. The line moves straight between (0, 0) and (250, 0). A point is indicated at (0, 11). The lines for E account length and E f of x intersect at (99, 0). All values are estimated.

Figure 4-39

Partial dependency plot for 18th record from training dataset

import numpy as np
pd.DataFrame(np.round(shap_values.values,2)).head(2)
# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values[sample_ind], max_display=14)

A SHAP waterfall plot. The features, the number of customer service calls, and voicemail messages have the maximum and the minimum values of 2.01 and negative 0.54, respectively. The lines for f of x and E f of x are at negative 1.352 and negative 2.239 respectively.

Figure 4-40

Local explanation for 18th record

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.scatter(shap_values[:,"account_length"])

A scatterplot has SHAP values versus account length. The horizontal clusters of points rise between approximately (0, negative 0.1) and (200, 0.3).

Figure 4-41

Distribution of account length versus SHAP values

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.scatter(shap_values[:,"number_vmail_messages"])

A scatterplot has SHAP values versus the number of voicemail messages. The points are mostly between 15 and 45 on the X-axis and negative 1.0 and negative 0.25 on the Y-axis. Another cluster is between (0, 0.0) and (0, 0.5). All values are estimated.

Figure 4-42

Distribution of number of voicemail messages versus its SHAP values

# the waterfall_plot shows how we get from explainer.expected_value to model.predict(X)[sample_ind]
shap.plots.beeswarm(shap_values, max_display=14)

A SHAP bee swarm plot. The high feature values of the total day minutes and the number of voicemail messages have the maximum and the minimum SHAP values of almost 4 and negative 2, respectively.

Figure 4-43

SHAP value impact on model output

shap.plots.bar(shap_values)

A horizontal bar graph of features versus the mean SHAP values has 10 bars. The features, total day minutes, and total night calls have maximum and minimum values of 0.49 and 0.06, respectively.

Figure 4-44

Absolute average SHAP values shows importance of features

shap.plots.heatmap(shap_values[:5000])

A SHAP heatmap plots features versus instances. A gradient scale has SHAP values between negative 1.98 to 1.98. The total day minutes feature has the highest density of SHAP values. A fluctuating line labeled f of x moves almost straight at the top with a peak at 1600.

Figure 4-45

Distribution of density of all features with their SHAP values

shap.plots.scatter(shap_values[:,"total_day_minutes"])

A scatterplot of SHAP value versus total day minutes. The clusters of points move straight between (50, 0) and (150, 0), decline slightly, and then fewer points rise till (300, 4). All values are estimated.

Figure 4-46

Distribution of feature total day minutes with SHAP values

shap.plots.scatter(shap_values[:,"total_day_minutes"], color=shap_values[:,"account_length"])

A scatterplot of SHAP values and account length versus total day minutes. A gradient scale for account length has values between 40 and 160. The points of various shades move straight between (50, 0) and (150, 0), decline slightly, and then rise till (300, 4). All values are estimated.

Figure 4-47

Three-dimensional view of SHAP values

Recipe 4-19. Explain a Random Forest Classifier

Problem

You want to get faster explanations from global and local explainable libraries using a random forest classifier. A random forest creates a family of trees as estimators and averages the predictions using the majority voting rule.

Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations.

How It Works

Let’s take a look at the following example:
import shap
from sklearn.ensemble import RandomForestClassifier
rforest = RandomForestClassifier(n_estimators=100, max_depth=3, min_samples_split=20, random_state=0)
rforest.fit(X, Y)
# explain all the predictions in the test set
explainer = shap.TreeExplainer(rforest)
shap_values = explainer.shap_values(X)
shap.dependence_plot("account_length", shap_values[0], X)

A scatterplot of SHAP values and total day minutes versus account length. A gradient scale for total day minutes has values between 100 and 260. The points of various shades decline between (25, 0.0050) and (225, negative 0.0125). All values are estimated.

Figure 4-48

Dependence plot from SHAP

shap.partial_dependence_plot(
    "total_day_minutes", rforest.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A SHAP partial dependence plot of total day minutes. The line moves horizontally from (0, 0.00), rises up to (300, 0.05), and ends at (350, 0.05). The highest bar is at (175, 0). The lines for E total day minutes and E f of x intersect at (175, 0.38). All values are estimated.

Figure 4-49

Partial dependence plot of total day minutes

shap.summary_plot(shap_values, X)

A horizontally stacked bar graph of features versus mean SHAP value. The class 0 and 1 bars of the number of customer service calls are the highest at around 0.06 and the class 1 bar of area code t r is the lowest at almost 0.00.

Figure 4-50

Feature importance for two classes separately based on absolute average SHAP value

Recipe 4-20. Catboost Model Interpretation for Classification Scenario

Problem

You want to get an explanation for the catboost model–based binary classification problem.

Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model, LIME is very useful. Catboost needs iterations to be defined.

How It Works

Let’s take a look at the following example:
model = CatBoostClassifier(iterations=10, learning_rate=0.1, random_seed=12)
model.fit(X, Y, verbose=True, plot=False)
0: learn: 0.6381393    total: 10.2ms    remaining: 91.9ms
1: learn: 0.5900921    total: 20.1ms    remaining: 80.2ms
2: learn: 0.5517727    total: 29.9ms    remaining: 69.8ms
3: learn: 0.5166202    total: 39.9ms    remaining: 59.9ms
4: learn: 0.4872410    total: 49.9ms    remaining: 49.9ms
5: learn: 0.4632012    total: 60.1ms    remaining: 40ms
6: learn: 0.4414588    total: 69.8ms    remaining: 29.9ms
7: learn: 0.4222780    total: 79.6ms    remaining: 19.9ms
8: learn: 0.4073681    total: 89.5ms    remaining: 9.95ms
9: learn: 0.3915051    total: 99.5ms    remaining: 0us
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(Pool(X, Y))
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
shap.force_plot(explainer.expected_value, shap_values[91,:], X.iloc[91,:])
shap.summary_plot(shap_values, X)

A SHAP bee swarm plot for 16 features. The high feature values of customer service calls and voicemail messages have the maximum and the minimum SHAP values of almost 0.9 and negative 0.2, respectively.

Figure 4-51

SHAP value impact on model output

Recipe 4-21. Local Explanations Using LIME

Problem

You want to get faster explanations from global and local explainable libraries.

Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model, LIME is very useful.

How It Works

Let’s take a look at the following example:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(xtrain),
                    feature_names=list(xtrain.columns),
                    class_names=['target_churn_dum'],
                    verbose=True, mode='classification')
# this record is a no churn scenario
exp = explainer.explain_instance(X.iloc[0], model.predict_proba, num_features=16)
exp.as_list()
Intercept 0.2758028503306529
Prediction_local [0.34562036]
Right: 0.23860629814459952
[('number_customer_service_calls > 2.00', 0.06944779279619419),
 ('total_day_minutes <= 144.10', -0.026032556397868205),
 ('area_code_tr > 1.00', 0.012192087473855579),
 ('total_day_charge <= 24.50', -0.01049348495191592),
 ('total_night_charge > 10.57', 0.009208937152255816),
 ('total_eve_calls <= 88.00', 0.007763649795450518),
 ('17.12 < total_eve_charge <= 19.74', 0.006648493070415344),
 ('number_vmail_messages <= 0.00', 0.0054214568436186375),
 ('98.00 < account_length <= 126.00', 0.004192090777110732),
 ('2.81 < total_intl_charge <= 3.21', -0.004030006982470514),
 ('201.40 < total_eve_minutes <= 232.20', -0.0039743556975642405),
 ('total_night_minutes > 234.80', 0.0035628982403953778),
 ('total_night_calls <= 86.00', 0.0029612465055136334),
 ('total_day_calls > 112.00', -0.0028523783898236925),
 ('total_intl_calls <= 3.00', -0.002506612124522332),
 ('10.40 < total_intl_minutes <= 11.90', -0.0016917444417898933)]
exp.show_in_notebook(show_table=True)

A SHAP local bar chart and table for a no-churn scenario. The predicted probabilities for the target and others are 0.76 and 0.24. A table with 2 columns of feature and value has 14 rows. The minimum and maximum values of bars in the graph are total day minutes, negative 0.03, and the number of customer service calls, 0.07 respectively.

Figure 4-52

Local explanation for record number 1 from test set

# This is s churn scenario
exp = explainer.explain_instance(X.iloc[20], model.predict_proba, num_features=16)
exp.as_list()
Intercept 0.32979383442829424
Prediction_local [0.22940692]
Right: 0.25256892775050466
[('number_customer_service_calls <= 1.00', -0.03195279452926141),
 ('144.10 < total_day_minutes <= 181.00', -0.03105192670898253),
 ('total_intl_charge > 3.21', 0.010519683979779627),
 ('101.00 < total_eve_calls <= 114.00', -0.008871850152517477),
 ('0.00 < area_code_tr <= 1.00', -0.008355187259945206),
 ('total_intl_minutes > 11.90', 0.007391379556830906),
 ('24.50 < total_day_charge <= 30.77', -0.006975112181235882),
 ('total_night_charge <= 7.56', -0.006500647887830215),
 ('total_eve_charge <= 14.14', -0.006278552413626889),
 ('number_vmail_messages > 0.00', -0.0062185929677679875),
 ('total_night_minutes <= 167.90', -0.003079244107811434),
 ('4.00 < total_intl_calls <= 5.00', -0.0026984920221149998),
 ('total_day_calls > 112.00', -0.0024708590253414045),
 ('total_eve_minutes <= 166.40', -0.002156339757484174),
 ('98.00 < account_length <= 126.00', -0.0013292154399683106),
 ('86.00 < total_night_calls <= 99.00', -0.00035916152353229)]
exp.show_in_notebook(show_table=True)

A SHAP local bar chart and table for a s churn scenario. The predicted probabilities for the target and others are 0.75 and 0.25. A table with 2 columns of feature and value has 14 rows. The minimum and maximum values of bars in the graph are total day minutes, negative 0.03, and the total international charge, 0.01 respectively.

Figure 4-53

Local explanations from 20th record from the test set

In a similar fashion, the graphs can be generated for different records from the training set and test set, which are from the training sample as well as test sample.

Recipe 4-22. Model Explanations Using ELI5

Problem

You want to get model explanations using the ELI5 library.

Solution

ELI5 provides two functions, show weights and show predictions, to generate model explanations.

How It Works

Let’s take a look at the following example:
eli5.show_weights(model,
                 feature_names=list(X.columns))

Weight

Feature

0.3703

total_day_minutes

0.2426

number_customer_service_calls

0.1181

total_day_charge

0.0466

total_eve_charge

0.0427

number_vmail_messages

0.0305

total_eve_minutes

0.0264

total_eve_calls

0.0258

total_intl_minutes

0.0190

total_night_minutes

0.0180

total_night_charge

0.0139

total_intl_charge

0.0133

area_code_tr

0.0121

total_day_calls

0.0110

total_intl_calls

0.0077

total_night_calls

0.0019

account_length

eli5.explain_weights(model, feature_names=list(X.columns))

Weight

Feature

0.3703

total_day_minutes

0.2426

number_customer_service_calls

0.1181

total_day_charge

0.0466

total_eve_charge

0.0427

number_vmail_messages

0.0305

total_eve_minutes

0.0264

total_eve_calls

0.0258

total_intl_minutes

0.0190

total_night_minutes

0.0180

total_night_charge

0.0139

total_intl_charge

0.0133

area_code_tr

0.0121

total_day_calls

0.0110

total_intl_calls

0.0077

total_night_calls

0.0019

account_length

from eli5.sklearn import PermutationImportance
perm = PermutationImportance(model)
perm.fit(X,Y)
eli5.show_weights(perm,feature_names=list(X.columns))

Weight

Feature

0.0352 ± 0.0051

total_day_minutes

0.0250 ± 0.0006

total_day_charge

0.0121 ± 0.0024

number_vmail_messages

0.0110 ± 0.0051

total_eve_charge

0.0052 ± 0.0048

total_night_minutes

0.0028 ± 0.0025

total_night_charge

0.0023 ± 0.0009

total_eve_calls

0.0022 ± 0.0012

number_customer_service_calls

0.0022 ± 0.0018

total_eve_minutes

0.0019 ± 0.0012

total_night_calls

0.0018 ± 0.0015

total_day_calls

0.0017 ± 0.0019

total_intl_minutes

0.0011 ± 0.0016

area_code_tr

0.0008 ± 0.0012

total_intl_charge

0.0005 ± 0.0018

total_intl_calls

-0.0010 ± 0.0018

account_length

Recipe 4-23. Multiclass Classification Model Explanation

Problem

You want to get model explanations for multiclass classification problems.

Solution

The expectation for multiclass classification is to first build a robust model with categorical features, if any, and explain the predictions. In a binary class classification problem, we can get the probabilities, and sometimes we can get the feature importance corresponding to each class from all kinds of ensemble models. Here is an example of a catboost model that can be used to generate the feature importance corresponding to each class in the multiclass classification problem.

How It Works

Let’s take a look at the following example. We are going to use a dataset from the UCI ML repository. The URL to access the dataset is given in the following script:
import pandas as pd
df_red = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')
df_white = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',sep=';')
features = ['fixed_acidity','volatile_acidity','citric_acid','residual_sugar',
            'chlorides','free_sulfur_dioxide','total_sulfur_dioxide','density',
            'pH','sulphates','alcohol','quality']
df = pd.concat([df_red,df_white],axis=0)
df.columns = features
df.quality = pd.Categorical(df.quality)
df.head()
y = df.pop('quality')
X = df
import catboost
from catboost import *
import shap
shap.initjs()
model = CatBoostClassifier(loss_function = 'MultiClass',
                           iterations=300,
                           learning_rate=0.1,
                           random_seed=123)
model.fit(X, y, verbose=False, plot=False)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(Pool(X, y))
set(y)
{3, 4, 5, 6, 7, 8, 9}
shap.summary_plot(shap_values[0], X)

A SHAP bee swarm plot for 11 features. The high feature value of free sulfur dioxide and alcohol have the maximum and minimum SHAP values of almost 2.0 and negative 0.75, respectively.

Figure 4-54

SHAP value impact with respect to class 0 from the target variable

shap.summary_plot(shap_values[1], X)

A SHAP bee swarm plot for 11 features. The low feature value of free sulfur dioxide has a maximum SHAP value of almost 2.0. The high feature value of alcohol has a minimum SHAP value of around negative 1.5.

Figure 4-55

SHAP summary plot for class 2 from the target variable

shap.summary_plot(shap_values[2], X)

A SHAP bee swarm plot for 11 features. The low and the high feature values of alcohol have the maximum and the minimum SHAP values of almost 1.0 and negative 2.5, respectively.

Figure 4-56

SHAP summary plot for class 3 from the target variable

Conclusion

In this chapter, we discussed the ensemble model explanations. The models we covered were explainable boosting regressor, explainable boosting classifier, extreme gradient boosting regressor and classifier, random forest regressor and classifier, and catboost classifier and regressor. The graphs and charts sometimes may look similar, but they are different, because of two reasons. First, the data points from SHAP that are available for plotting depend on the sample size selected to generate explanations. Second, the sample models are being trained with fewer iterations and with basic hyperparameters; hence, with a higher configuration machine, the full hyperparameter tuning can happen, and better SHAP values can be produced.

In the next chapter, we will cover the explainability for natural language–based tasks such as text classification and sentiment analysis and explain the predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
35.170.81.33