© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
P. MishraExplainable AI Recipes https://doi.org/10.1007/978-1-4842-9029-3_2

2. Explainability for Linear Supervised Models

Pradeepta Mishra1  
(1)
Bangalore, Karnataka, India
 

A supervised learning model is a model that is used to train an algorithm to map input data to output data. A supervised learning model can be of two types: regression or classification. In a regression scenario, the output variable is numerical, whereas with classification, the output variable is binary or multinomial. A binary output variable has two outcomes, such as true and false, accept and reject, yes and no, etc. In the case of multinomial output variables, the outcome can be more than two, such as high, medium, and low. In this chapter, we are going to use explainable libraries to explain a regression model and a classification model, while training a linear model.

In the classical predictive modeling scenario, a function has been identified, and the input data is usually fit to the function to produce the output, where the function is usually predetermined. In a modern predictive modeling scenario, the input data and output are both shown to a group of functions, and the machine identifies the best function that approximates well to the output given a particular set of input. There is a need to explain the output of machine learning and deep learning models when performing regression and classification tasks. Linear regression and linear classification models are simpler to explain.

The goal of this chapter is to introduce various explainability libraries for linear models such as feature importance, partial dependency plot, and local interpretation.

Recipe 2-1. SHAP Values for a Regression Model on All Numerical Input Variables

Problem

You want to explain a regression model built on all the numeric features of a dataset.

Solution

A regression model on all the numeric features is trained, and then the trained model will be passed through SHAP to generate global explanations and local explanations.

How It Works

Let’s take a look at the following script. The Shapely value can be called the SHAP value. It is used to explain the model. It uses the impartial distribution of predictions from a cooperative game theory to attribute a feature to the model’s predictions. Input features from the dataset are considered as players in the game. The models function is considered the rules of the game. The Shapely value of a feature is computed based on the following steps:
  1. 1.

    SHAP requires model retraining on all feature subsets; hence, usually it takes time if the explanation has to be generated for larger datasets.

     
  2. 2.

    Identify a feature set from a list of features (let’s say there are 15 features, and we can select a subset with 5 features).

     
  3. 3.

    For any particular feature, two models using the subset of features will be created, one with the feature and another without the feature.

     
  4. 4.

    Then the prediction differences will be computed.

     
  5. 5.

    The differences in prediction are computed for all possible subsets of features.

     
  6. 6.

    The weighted average value of all possible differences is used to populate the feature importance.

     

If the weight of the feature is 0.000, then we can conclude that the feature is not important and has not joined the model. If it is not equal to 0.000, then we can conclude that the feature has a role to play in the prediction process.

We are going to use a dataset from the UCI machine learning repository. The URL to access the dataset is as follows:

https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

The objective is to predict the appliances’ energy use in Wh, using the features from sensors. There are 27 features in the dataset, and here we are trying to understand what features are important in predicting the energy usage. See Table 2-1.
Table 2-1

Feature Description from the Energy Prediction Dataset

Feature Name

Description

Unit

Appliances

Energy use

In Wh

Lights

Energy use of light fixtures in the house

In Wh

T1

Temperature in kitchen area

In Celsius

RH_1

Humidity in kitchen area

In %

T2

Temperature in living room area

In Celsius

RH_2

Humidity in living room area

In %

T3

Temperature in laundry room area

RH_3

Humidity in laundry room area

In %

T4

Temperature in office room

In Celsius

RH_4

Humidity in office room

In %

T5

Temperature in bathroom

In Celsius

RH_5

Humidity in bathroom

In %

T6

Temperature outside the building (north side)

In Celsius

RH_6

Humidity outside the building (north side)

In %

T7

Temperature in ironing room

In Celsius

RH_7

Humidity in ironing room

In %

T8

Temperature in teenager room 2

In Celsius

RH_8

Humidity in teenager room 2

In %

T9

Temperature in parents room

In Celsius

RH_9

Humidity in parents room

In %

To

Temperature outside (from the Chievres weather station)

In Celsius

Pressure (from Chievres weather station)

 

In mm Hg

aRH_out

Humidity outside (from the Chievres weather station)

In %

Wind speed (from Chievres weather station)

 

In m/s

Visibility (from Chievres weather station)

 

In km

Tdewpoint (from Chievres weather station)

 

°C

rv1

Random variable 1

Nondimensional

rv2

Random variable 2

Nondimensional

import pandas as pd
df_lin_reg = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv')
del df_lin_reg['date']
df_lin_reg.info()
df_lin_reg.columns
Index(['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2'], dtype='object')
#y is the dependent variable, that we need to predict
y = df_lin_reg.pop('Appliances')
# X is the set of input features
X = df_lin_reg
import pandas as pd
import shap
import sklearn
# a simple linear model initialized
model = sklearn.linear_model.LinearRegression()
# linear regression model trained
model.fit(X, y)
print("Model coefficients: ")
for i in range(X.shape[1]):
    print(X.columns[i], "=", model.coef_[i].round(5))
Model coefficients:
lights = 1.98971
T1 = -0.60374
RH_1 = 15.15362
T2 = -17.70602
RH_2 = -13.48062
T3 = 25.4064
RH_3 = 4.92457
T4 = -3.46525
RH_4 = -0.17891
T5 = -0.02784
RH_5 = 0.14096
T6 = 7.12616
RH_6 = 0.28795
T7 = 1.79463
RH_7 = -1.54968
T8 = 8.14656
RH_8 = -4.66968
T9 = -15.87243
RH_9 = -0.90102
T_out = -10.22819
Press_mm_hg = 0.13986
RH_out = -1.06375
Windspeed = 1.70364
Visibility = 0.15368
Tdewpoint = 5.0488
rv1 = -0.02078
rv2 = -0.02078
# compute the SHAP values for the linear model
explainer = shap.Explainer(model.predict, X)
# SHAP value calculation
shap_values = explainer(X)
Permutation explainer: 19736it [16:15, 20.08it/s]

This part of the script takes time as it is a computationally intensive process. The explainer function calculates permutations, which means taking a feature set and generating the prediction difference. This difference is the presence of one feature and the absence of the same feature. For faster calculation, we can reduce the sample size to a smaller set, let’s say 1,000 or 2,000. In the previous script, we are using the entire population of 19,735 records to calculate the SHAP values. This part of the script can be improved by applying Python multiprocessing, which is beyond the scope of this chapter.

The SHAP value for a specific feature ? is just the difference between the expected model output and the partial dependence plot at the feature’s value ??. One of the fundamental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models, this means that SHAP values of all the input features will always sum up to the difference between the baseline (expected) model output and the current model output for the prediction being explained.

SHAP values have three objects: (a) the SHAP value for each feature, (b) the base value, and (c) the original training data. As there are 27 features, we can expect 27 shap values.
pd.DataFrame(np.round(shap_values.values,3)).head(3)

A table has 3 rows and 27 columns. The rows have decimal values for 0, 1, and 2. The columns have decimal values from 0 to 26.

# average prediction value is called as the base value
pd.DataFrame(np.round(shap_values.base_values,3)).head(3)

A tabular representation has 3 rows and 2 columns. Row 1 has 0 and 97.484. Row 2 has 1 and 97.494. Row 3 has 2 and 97.494.

pd.DataFrame(np.round(shap_values.data,3)).head(3)

A table has 3 rows and 27 columns. The rows have decimal values for 0, 1, and 2. The columns have decimal values from 0 to 26.

Recipe 2-2. SHAP Partial Dependency Plot for a Regression Model

Problem

You want to get a partial dependency plot from SHAP.

Solution

The solution to this problem is to use the partial dependency method (partial_dependence_plot) from the model.

How It Works

Let’s take a look at the following example. There are two ways to get the partial dependency plot, one with a particular data point superimposed and the other without any reference to the data point. See Figure 2-1.
# make a standard partial dependence plot for lights on predicted output for row number 20 from the training dataset.
sample_ind = 20
shap.partial_dependence_plot(
    "lights", model.predict, X, model_expected_value=True,
    feature_expected_value=True, ice=False,
    shap_values=shap_values[sample_ind:sample_ind+1,:]
)

A graph of E of function x lights versus lights. It has an increasing slope from 80 to 220, a vertical line for E of lights, and a horizontal line for E of function x. Values are approximated.

Figure 2-1

Correlation between feature light and predicted output of the model

The partial dependency plot is a way to explain the individual predictions and generate local interpretations for the sample selected from the dataset; in this case, the sample 20th record is selected from the training dataset. Figure 2-1 shows the partial dependency superimposed with the 20th record in red.
shap.partial_dependence_plot(
    "lights", model.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A graph of E of function x lights versus lights. It has an increasing slope from 80 to 220, a vertical line for E of lights, and a horizontal line for E of function x. Values are approximated.

Figure 2-2

Partial dependency plot between lights and predicted outcome from the model

# the waterfall_plot shows how we get from shap_values.base_values to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values[sample_ind], max_display=14)

A graphical representation for function x equals 140.269 and E function x equals 97.494 has 14 bar values. The highest value is 119.05 for R H underscore 1. Values are approximated.

Figure 2-3

Local interpretation for record number 20

The local interpretation for record number 20 from the training dataset is displayed in Figure 2-3. The predicted output for the 20th record is 140 Wh. The most influential feature impacting the 20th record is RH_1, which is the humidity in the kitchen area in percentage, and RH_2, which is the humidity in the living room area. On the bottom of Figure 2-3, there are 14 features that are not very important for the 20th record’s predicted value.
X[20:21]
model.predict(X[20:21])
array([140.26911466])

Recipe 2-3. SHAP Feature Importance for Regression Model with All Numerical Input Variables

Problem

You want to calculate the feature importance using the SHAP values.

Solution

The solution to this problem is to use SHAP absolute values from the model.

How It Works

Let’s take a look at the following example. SHAP values can be used to show the global importance of features. Importance features means features that have a larger importance in predicting the output.
#computing shap importance values for the linear model
import numpy as np
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
print(shap_importance)
       col_name  feature_importance_vals
2          RH_1                49.530061
19        T_out                43.828847
4          RH_2                42.911069
5            T3                41.671587
11           T6                34.653893
3            T2                31.097282
17           T9                26.607721
16         RH_8                19.920029
24    Tdewpoint                17.443688
21       RH_out                13.044643
6          RH_3                13.042064
15           T8                12.803450
0        lights                11.907603
12         RH_6                 7.806188
14         RH_7                 6.578015
7            T4                 5.866801
22    Windspeed                 3.361895
13           T7                 3.182072
18         RH_9                 3.041144
23   Visibility                 1.385616
10         RH_5                 0.855398
20  Press_mm_hg                 0.823456
1            T1                 0.765753
8          RH_4                 0.642723
25          rv1                 0.260885
26          rv2                 0.260885
9            T5                 0.041905

All the feature importance values are not scaled; hence, sum of values from all features will not be totaling 100.

The beeswarm chart in Figure 2-4 shows the impact of SHAP values on model output. The blue dot shows a low feature value, and a red dot shows a high feature value. Each dot indicates one data point from the dataset. The beeswarm plot shows the distribution of feature values against the SHAP values.
shap.plots.beeswarm(shap_values)

A graph of feature value and S H A P value has 10 fluctuating signals. It has signals for R H 1, T out, R H 2, T 3, T 3, T 2, R H 8, t dewpoint, and the sum of 18 other features.

Figure 2-4

Impact on model output

Recipe 2-4. SHAP Values for a Regression Model on All Mixed Input Variables

Problem

How do you estimate SHAP values when you introduce the categorical variables along with the numerical variables, which is a mixed set of input features.

Solution

The solution is that the mixed input variables that have numeric features as well as categorical or binary features can be modeled together. As the number of features increases, the time to compute all the permutations will also increase.

How It Works

We are going to use an automobile public dataset with some modifications. The objective is to predict the price of a vehicle given the features such as make, location, age, etc. It is a regression problem that we are going to solve using a mix of numeric and categorical features.
df = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/automobile.csv')
df.head(3)
df.columns
Index(['Price', 'Make', 'Location', 'Age', 'Odometer', 'FuelType', 'Transmission', 'OwnerType', 'Mileage', 'EngineCC', 'PowerBhp'], dtype='object')
We cannot use string-based features or categorical features in the model directly as matrix multiplication is not possible on string features; hence, the string-based features need to be transformed into dummy variables or binary features with 0 and 1 flags. The transformation step is skipped here because many data scientists already know how to do this data transformation. We are importing another transformed dataset directly.
df_t = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/Automobile_transformed.csv')
del df_t['Unnamed: 0']
df_t.head(3)
df_t.columns
Index(['Price', 'Age', 'Odometer', 'mileage', 'engineCC', 'powerBhp', 'Location_Bangalore', 'Location_Chennai', 'Location_Coimbatore', 'Location_Delhi', 'Location_Hyderabad', 'Location_Jaipur', 'Location_Kochi', 'Location_Kolkata', 'Location_Mumbai', 'Location_Pune', 'FuelType_Diesel', 'FuelType_Electric', 'FuelType_LPG', 'FuelType_Petrol', 'Transmission_Manual', 'OwnerType_Fourth +ACY- Above', 'OwnerType_Second', 'OwnerType_Third'], dtype='object')
#y is the dependent variable, that we need to predict
y = df_t.pop('Price')
# X is the set of input features
X = df_t
import pandas as pd
import shap
import sklearn
# a simple linear model initialized
model = sklearn.linear_model.LinearRegression()
# linear regression model trained
model.fit(X, y)
print("Model coefficients: ")
for i in range(X.shape[1]):
    print(X.columns[i], "=", model.coef_[i].round(5))
Model coefficients:
Age = -0.92281
Odometer = 0.0
mileage = -0.07923
engineCC = -4e-05
powerBhp = 0.1356
Location_Bangalore = 2.00658
Location_Chennai = 0.94944
Location_Coimbatore = 2.23592
Location_Delhi = -0.29837
Location_Hyderabad = 1.8771
Location_Jaipur = 0.8738
Location_Kochi = 0.03311
Location_Kolkata = -0.86024
Location_Mumbai = -0.81593
Location_Pune = 0.33843
FuelType_Diesel = -1.2545
FuelType_Electric = 7.03139
FuelType_LPG = 0.79077
FuelType_Petrol = -2.8691
Transmission_Manual = -2.92415
OwnerType_Fourth +ACY- Above = 1.7104
OwnerType_Second = -0.55923
OwnerType_Third = 0.76687
To compute the SHAP values, we can use the explainer function with the training dataset X and model predict function. The SHAP value calculation happens using a permutation approach; it took 5 minutes.
# compute the SHAP values for the linear model
explainer = shap.Explainer(model.predict, X)
# SHAP value calculation
shap_values = explainer(X)
Permutation explainer: 6020it [05:14, 18.59it/s]
import numpy as np
pd.DataFrame(np.round(shap_values.values,3)).head(3)

A table has 3 rows and 23 columns. The rows have decimal values for 0, 1, and 2. The columns have decimal values from 0 to 22.

# average prediction value is called as the base value
pd.DataFrame(np.round(shap_values.base_values,3)).head(3)
 

0

0

11.933

1

11.933

2

11.933

pd.DataFrame(np.round(shap_values.data,3)).head(3)

A table has 3 rows and 23 columns. The rows have decimal values for 0, 1, and 2. The columns have decimal values from 0 to 22.

Recipe 2-5. SHAP Partial Dependency Plot for Regression Model for Mixed Input

Problem

You want to plot the partial dependency plot and interpret the graph for numeric and categorical dummy variables.

Solution

The partial dependency plot shows the correlation between the feature and the predicted output of the target variables. There are two ways we can showcase the results, one with a feature and expected value of the prediction function and the other with superimposing a data point on the partial dependency plot.

How It Works

Let’s take a look at the following example (see Figure 2-5):
shap.partial_dependence_plot(
    "powerBhp", model.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A graph of E of function x power B H P versus power B H P. It has an increasing slope from 0 to 70, a vertical line for E of power B H P, and a horizontal line for E of function x. Values are approximated.

Figure 2-5

Partial dependency plot for powerBhp and predicted price of the vehicle

The linear blue line shows the positive correlation between the price and the powerBhp. The powerBhp is a strong feature. The higher the bhp, the higher the price of the car. This is a continuous or numeric feature; let’s look at the binary or dummy features. There are two dummy features if the car is registered in a Bangalore location or in a Kolkata location as dummy variables. See Figure 2-6.
shap.partial_dependence_plot(
    "Location_Bangalore", model.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A graph of E of function x location Bangalore versus location Bangalore. It has an increasing slope from 0 to 11.5, a vertical line for E of location Bangalore, and a horizontal line for E of function x. Values are approximated.

Figure 2-6

Dummy variable Bangalore location versus SHAP value

If the location of the car is Bangalore, then the price would be higher, and vice versa. See Figure 2-7.
shap.partial_dependence_plot(
    "Location_Kolkata", model.predict, X, ice=False,
    model_expected_value=True, feature_expected_value=True
)

A graph of E of function x location Kolkata versus location Kolkata. It has a decreasing slope from 9.6 to 0, a vertical line for E of location Kolkata, and a horizontal line for E of function x. Values are approximated.

Figure 2-7

Dummy variable Location_Kolkata versus SHAP value

If the location is Kolkata, then the price is expected to be lower. The reason for the difference between the two locations is in the data that is being used to train the model. The previous three figures show the global importance of a feature versus the prediction function. As an example, only two features are taken into consideration; we can use all features one by one and display many graphs to get more understanding about the predictions.

Now let’s look at a sample data point superimposed on a partial dependence plot to display local explanations. See Figure 2-8.
# make a standard partial dependence plot for lights on predicted output
sample_ind = 20 #20th record from the dataset
shap.partial_dependence_plot(
    "powerBhp", model.predict, X, model_expected_value=True,
    feature_expected_value=True, ice=False,
    shap_values=shap_values[sample_ind:sample_ind+1,:]
)

A graph of E of function x power B H P versus power B H P. It has an increasing slope from 0 to 70, a vertical line for E of power B H P, and a horizontal line for E of function x. Values are approximated.

Figure 2-8

Power bhp versus prediction function

The vertical dotted line shows the average powerBhp, and the horizontal dotted line shows the average predicted value by the model. The small blue bar dropping from the black dot reflects the placement of record number 20 from the dataset. Local interpretation means that for any sample record from the dataset, we should be able to explain the predictions. Figure 2-9 shows the importance of features corresponding to each record in the dataset.
# the waterfall_plot shows how we get from shap_values.base_values to model.predict(X)[sample_ind]
shap.plots.waterfall(shap_values[sample_ind], max_display=14)

A graphical representation for function x equals 22.542 and E function x equals 11.933 has 14 horizontal values. The highest value is 8.8 for power B H P. Values are approximated.

Figure 2-9

Local interpretation of the 20th record and corresponding feature importance

For the 20th record, the predicted price is 22.542, the powerBhp stands out to be most important feature, and manual transmission is the second most important feature.
X[20:21]
model.predict(X[20:21])
array([22.54213017])

Recipe 2-6. SHAP Feature Importance for a Regression Model with All Mixed Input Variables

Problem

You want to get the global feature importance from SHAP values using mixed-input feature data.

Solution

The solution to this problem is to use absolute values and sort them in descending order.

How It Works

Let’s take a look at the following example:
#computing shap importance values for the linear model
import numpy as np
# feature names from the training data
feature_names = shap_values.feature_names
#combining the shap values with feature names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
#taking the absolute shap values
vals = np.abs(shap_df.values).mean(0)
#creating a dataframe view
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
#sorting the importance values
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
print(shap_importance)
col_name  feature_importance_vals
4                       powerBhp                 6.057831
0                            Age                 2.338342
18               FuelType_Petrol                 1.406920
19           Transmission_Manual                 1.249077
15               FuelType_Diesel                 0.618288
7            Location_Coimbatore                 0.430233
9             Location_Hyderabad                 0.401118
2                        mileage                 0.270872
13               Location_Mumbai                 0.227442
5             Location_Bangalore                 0.154706
21              OwnerType_Second                 0.154429
6               Location_Chennai                 0.133476
10               Location_Jaipur                 0.127807
12              Location_Kolkata                 0.111829
14                 Location_Pune                 0.051082
8                 Location_Delhi                 0.049372
22               OwnerType_Third                 0.021778
3                       engineCC                 0.020145
1                       Odometer                 0.009602
11                Location_Kochi                 0.007474
20  OwnerType_Fourth +ACY- Above                 0.002557
16             FuelType_Electric                 0.002336
17                  FuelType_LPG                 0.001314

At a high level, for the linear model that is used to predict the price of the automobiles, the previous features are important, with the highest being the powerBhp, age of the car, petrol type, manual transmission type, etc. The previous tabular output shows global feature importance.

Recipe 2-7. SHAP Strength for Mixed Features on the Predicted Output for Regression Models

Problem

You want to know the impact of a feature on the model function.

Solution

The solution to this problem is to use a beeswarm plot that displays the blue and red points.

How It Works

Let’s take a look at the following example (see Figure 2-10). From the beeswarm plot there is a positive relationship between powerBhp and positive SHAP value; however, there is a negative correlation between the age of a car and the price of the car. As the feature value increases from a lower powerBhp value to a higher powerBhp value, the shap value increases and vice versa. However, there is an opposite trend for the age feature.
shap.plots.beeswarm(shap_values)

A graphical representation for feature value and S H A P value. It has 10 fluctuating wave signals for power, age, fuel type, transmission, location, mileage, and the sum of 14 other features.

Figure 2-10

The SHAP value impact on the model output

Recipe 2-8. SHAP Values for a Regression Model on Scaled Data

Problem

You don’t know whether getting SHAP values on scaled data is better than the unscaled numerical data.

Solution

The solution to this problem is to use a numerical dataset and generate local and global explanations after applying the standard scaler to the data.

How It Works

Let’s take a look at the following script:
import pandas as pd
df_lin_reg = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv')
del df_lin_reg['date']
#y is the dependent variable, that we need to predict
y = df_lin_reg.pop('Appliances')
# X is the set of input features
X = df_lin_reg
import pandas as pd
import shap
import sklearn
#create standardized features
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(X)
#transform the dataset
X_std = scaler.transform(X)
# a simple linear model initialized
model = sklearn.linear_model.LinearRegression()
# linear regression model trained
model.fit(X_std, y)
print("Model coefficients: ")
for i in range(X.shape[1]):
    print(X.columns[i], "=", model.coef_[i].round(5))
Model coefficients:
lights = 15.7899
T1 = -0.96962
RH_1 = 60.29926
T2 = -38.82785
RH_2 = -54.8622
T3 = 50.96675
RH_3 = 16.02699
T4 = -7.07893
RH_4 = -0.77668
T5 = -0.05136
RH_5 = 1.27172
T6 = 43.3997
RH_6 = 8.96929
T7 = 3.78656
RH_7 = -7.92521
T8 = 15.93559
RH_8 = -24.39546
T9 = -31.97757
RH_9 = -3.74049
T_out = -54.38609
Press_mm_hg = 1.03483
RH_out = -15.85058
Windspeed = 4.17588
Visibility = 1.81258
Tdewpoint = 21.17741
rv1 = -0.30118
rv2 = -0.30118
CodeText
# compute the SHAP values for the linear model
explainer = shap.Explainer(model.predict, X_std)
# SHAP value calculation
shap_values = explainer(X_std)
Permutation explainer: 19736it [08:53, 36.22it/s]
It is faster to get results from the SHAP explainer because we are using the standardized data. The SHAP values also changed a bit, but there are no major changes to the shap values.
 

Permutation explainer

Time

Unscaled data

19736it

15:22, 21.23it/s

Scaled data

19736it

08:53, 36.22it/s

#computing shap importance values for the linear model
import numpy as np
# feature names from the training data
feature_names = X.columns
#combining the shap values with feature names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
#taking the absolute shap values
vals = np.abs(shap_df.values).mean(0)
#creating a dataframe view
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
#sorting the importance values
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
print(shap_importance)
       col_name  feature_importance_vals
2          RH_1                49.530061
19        T_out                43.828847
4          RH_2                42.911069
5            T3                41.671587
11           T6                34.653893
3            T2                31.097282
17           T9                26.607721
16         RH_8                19.920029
24    Tdewpoint                17.443688
21       RH_out                13.044643
6          RH_3                13.042064
15           T8                12.803450
0        lights                11.907603
12         RH_6                 7.806188
14         RH_7                 6.578015
7            T4                 5.866801
22    Windspeed                 3.361895
13           T7                 3.182072
18         RH_9                 3.041144
23   Visibility                 1.385616
10         RH_5                 0.855398
20  Press_mm_hg                 0.823456
1            T1                 0.765753
8          RH_4                 0.642723
25          rv1                 0.260885
26          rv2                 0.260885
9            T5                 0.041905

Recipe 2-9. LIME Explainer for Tabular Data

Problem

You want to know how to generate explainability at a local level in a focused manner rather than at a global level.

Solution

The solution to this problem is to use the LIME library. LIME is a model-agnostic technique; it retrains the ML model while running the explainer. LIME localizes a problem and explains the model at a local level.

How It Works

Let’s take a look at the following example. LIME requires a numpy array as an input to the tabular explainer; hence, the Pandas dataframe needs to be transformed into an array.
!pip install lime
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
     |████████████████| 275 kB 3.9 MB/s
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from lime) (1.21.6)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from lime) (1.7.3)
Require
................
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X),
                                             mode='regression',
                                            feature_names=X.columns,
                                            class_names=['price'],
                                             verbose=True)
We are using the energy prediction data from this chapter only.
Explainer.feature_selection
# asking for explanation for LIME model
I = 60
exp = explainer.explain_instance(np.array(X)[i],
                                 model.predict,
                                 num_features=14
                                )
model.predict(X)[60]
X[60:61]
Intercept -142.75931081140854
Prediction_local [-492.87528974]
Right: -585.148657732673
exp.show_in_notebook(show_table=True)

A graphical representation has a minimum and maximum predicted value. A column for negative and positive values, and a table for feature values with 14 rows.

Figure 2-11

Local explanation for the 60th record from the dataset

exp.as_list()
[('RH_6 > 83.23', 464.95860873125986), ('RH_1 > 43.07', 444.5520820612734), ('RH_2 > 43.26', -373.10130212185885), ('RH_out > 91.67', -318.85242557316906), ('RH_8 > 46.54', -268.93915670002696), ('lights <= 0.00', -250.2220287090558), ('T3 <= 20.79', -167.06955734678837), ('3.67 < T_out <= 6.92', 131.73980385122888), ('3.63 < T6 <= 7.30', -103.65788170866274), ('T9 <= 18.00', 93.3237211878042), ('RH_7 > 39.00', -79.9838215229673), ('RH_3 > 41.76', 78.2163751694391), ('T8 <= 20.79', -45.00198774806178), ('18.79 < T2 <= 20.00', 43.92159150217912)]

Recipe 2-10. ELI5 Explainer for Tabular Data

Problem

You want to use the ELI5 library for generating explanations of a linear regression model.

Solution

ELI5 is a Python package that helps to debug a machine learning model and explain the predictions. It provides support for all machine learning models supported by the scikit-learn library.

How It Works

Let’s take a look at the following script:
pip install eli5
import eli5
eli5.show_weights(model,
                 feature_names=list(X.columns))
y top features

Weight?

Feature

+97.695

<BIAS>

+60.299

RH_1

+50.967

T3

+43.400

T6

+21.177

Tdewpoint

+16.027

RH_3

+15.936

T8

+15.790

Lights

+8.969

RH_6

+4.176

Windspeed

+3.787

T7

… 3 more positive …

… 5 more negative …

-3.740

RH_9

-7.079

T4

-7.925

RH_7

-15.851

RH_out

-24.395

RH_8

-31.978

T9

-38.828

T2

-54.386

T_out

-54.862

RH_2

eli5.explain_weights(model, feature_names=list(X.columns))
eli5.explain_prediction(model,X.iloc[60])
from eli5.sklearn import PermutationImportance
# a simple linear model initialized
model = sklearn.linear_model.LinearRegression()
# linear regression model trained
model.fit(X, y)
perm = PermutationImportance(model)
perm.fit(X, y)
eli5.show_weights(perm,feature_names=list(X.columns))

The results table has a BIAS value as a feature. This can be interpreted as an intercept term for a linear regression model. Other features are listed based on the descending order of importance based on their weight. The show weights function provides a global interpretation of the model, and the show prediction function provides a local interpretation by taking into account a record from the training set.

Recipe 2-11. How the Permutation Model in ELI5 Works

Problem

You want to make sense of the ELI5 permutation library.

Solution

The solution to this problem is to use a dataset and a trained model.

How It Works

The permutation model in the ELI5 library works only for global interpretation. First, it takes a base line linear regression model from the training dataset and computes the error of the model. Then it shuffles the values of a feature and retrains the model and computes the error. It compares the decrease in error after shuffling and before shuffling. A feature can be considered as important if post shuffling the error delta is high and unimportant if the error delta is low. The result displays the average importance of features and the standard deviation of features with multiple shuffle steps.

Recipe 2-12. Global Explanation for Logistic Regression Models

Problem

You want to explain the predictions generated from a logistic regression model.

Solution

The logistic regression model is also known as a classification model as we model the probabilities from either a binary classification or a multinomial classification variable. In this particular recipe, we are using a churn classification dataset that has two outcomes: whether the customer is likely to churn or not.

How It Works

Let’s take a look at the following example. The key is to get the SHAP values, which will return base values, SHAP values, and data. Using the SHAP values, we can create various explanations using graphs and figures. The SHAP values are always at a global level.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import confusion_matrix, classification_report
df_train = pd.read_csv('https://raw.githubusercontent.com/pradmishra1/PublicDatasets/main/ChurnData_test.csv')
from sklearn.preprocessing import LabelEncoder
tras = LabelEncoder()
df_train['area_code_tr'] = tras.fit_transform(df_train['area_code'])
df_train.columns
del df_train['area_code']
df_train.columns
df_train['target_churn_dum'] = pd.get_dummies(df_train.churn,prefix='churn',drop_first=True)
df_train.columns
del df_train['international_plan']
del df_train['voice_mail_plan']
del df_train['churn']
df_train.info()
del df_train['Unnamed: 0']
df_train.columns
from sklearn.model_selection import train_test_split
df_train.columns
X = df_train[['account_length', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_day_charge', 'total_eve_minutes',
       'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
       'total_night_calls', 'total_night_charge', 'total_intl_minutes',
       'total_intl_calls', 'total_intl_charge',
       'number_customer_service_calls', 'area_code_tr']]
Y = df_train['target_churn_dum']
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,test_size=0.20,stratify=Y)
log_model = LogisticRegression()
log_model.fit(xtrain,ytrain)
print("training accuracy:", log_model.score(xtrain,ytrain)) #training accuracy
print("test accuracy:",log_model.score(xtest,ytest)) # test accuracy
# Provide Probability as Output
def model_churn_proba(x):
    return log_model.predict_proba(x)[:,1]
# Provide Log Odds as Output
def model_churn_log_odds(x):
    p = log_model.predict_log_proba(x)
    return p[:,1] - p[:,0]
# compute the SHAP values for the linear model
background_churn = shap.maskers.Independent(X, max_samples=2000)
explainer = shap.Explainer(log_model, background_churn,feature_names=list(X.columns))
shap_values_churn = explainer(X)
shap_values_churn
.values = array([[-5.68387743e-03, 2.59884057e-01, -1.12707664e+00, ..., 1.70015539e-04, 6.35113804e-01, -5.98927431e-03], [-9.26328584e-02, 2.59884057e-01, 4.31613190e-01, ..., -4.82342680e-04, -7.11876922e-01, -5.98927431e-03], [-1.05143764e-02, -8.06452301e-01, 1.15736857e+00, ..., 2.05960486e-03, -2.62880014e-01, 5.88245015e-03], ..., [ 9.09261014e-02, 2.59884057e-01, -4.15611799e-01, ..., 1.99211953e-03, -2.62880014e-01, -5.34120777e-05], [-2.50058732e-02, 2.59884057e-01, 7.63911460e-02, ..., -1.08971068e-03, -7.11876922e-01, -5.98927431e-03], [ 3.05448646e-02, -9.90303397e-01, -5.29936135e-01, ..., -6.17313346e-04, -7.11876922e-01, -5.34120777e-05]]) .base_values = array([-2.18079251, -2.18079251, -2.18079251, ..., -2.18079251, -2.18079251, -2.18079251]) .data = array([[101. , 0. , 70.9 , ..., 2.86, 3. , 2. ], [137. , 0. , 223.6 , ..., 2.57, 0. , 2. ], [103. , 29. , 294.7 , ..., 3.7 , 1. , 0. ], ..., [ 61. , 0. , 140.6 , ..., 3.67, 1. , 1. ], [109. , 0. , 188.8 , ..., 2.3 , 0. , 2. ], [ 86. , 34. , 129.4 , ..., 2.51, 0. , 1. ]])
shap_values = pd.DataFrame(shap_values_churn.values)
shap_values.columns = list(X.columns)
shap_values

A table has 9 columns and 2 rows. The column headers are for account length, the number of v mail messages, total day minutes, total day calls, total day charge, total eve minutes, total eve calls, and total eve charge.

# compute the SHAP values for the linear model
explainer_log_odds = shap.Explainer(log_model, background_churn,feature_names=list(X.columns))
shap_values_churn_log_odds = explainer_log_odds(X)
shap_values_churn_log_odds

Recipe 2-13. Partial Dependency Plot for a Classifier

Problem

You want to show feature associations with the class probabilities.

Solution

The class probabilities in this example are related to predicting the probability of churn. The SHAP value for a feature can be plotted against the feature value to show a scatter chart that displays the correlation (positive or negative) and strength of associations.

How It Works

Let’s take a look at the following script:
shap.plots.scatter(shap_values_churn[:,'account_length'])
The Figure 2-12 shows the relationship between account length variable and the SHAP values of the account length variable.

A graph of S H A P value for account length versus account length. It has a decreasing solid line from 0.25 to minus 0.2 and a dotted line up to 0. Values are approximated.

Figure 2-12

Account length and SHAP value of account length

# make a standard partial dependence plot
sample_ind = 25
fig,ax = shap.partial_dependence_plot(
    "number_vmail_messages", model_churn_proba, X, model_expected_value=True,
    feature_expected_value=True, show=False,ice=False)
The Figure 2-13 shows the relationship between feature number of voice mail messages and the SHAP value of number of voice mail messages.

A graph of E of function x number of v mail messages versus the number of v mail messages. It has a decreasing slope from 0.18 to 0.02, a vertical line for E of the number of v mail messages, and a horizontal line for E of function x. Values are approximated.

Figure 2-13

Number of voicemail messages and their shap values

shap.plots.bar(shap_values_churn_log_odds)

A horizontal bar graph for the mean S H A P value has 10 decreasing bars. The highest bar value is 0.47 for the number of customer service calls. Values are approximated.

Figure 2-14

Mean absolute shap values of all features

Recipe 2-14. Global Feature Importance from the Classifier

Problem

You want to get the global feature importance for the logistic regression model.

Solution

The solution to this problem is to use a bar plot and beeswarm plot and heat map.

How It Works

Let’s take a look at the following script (see Figure 2-15 and Figure 2-16):
shap.plots.beeswarm(shap_values_churn_log_odds)

A graphical representation for feature value and S H A P value. It has 10 fluctuating wave signals for different features.

Figure 2-15

SHAP value impact on the model output

shap.plots.heatmap(shap_values_churn_log_odds[:1000])

A graph of different features, instances, and S H A P values. It has 10 color gradient signal forms. The signals are prominent for the number of customer service calls and total day minutes.

Figure 2-16

Heat map for SHAP value and positive and negative feature contributions

temp_df = pd.DataFrame()
temp_df['Feature Name'] = pd.Series(X.columns)
temp_df['Coefficients'] = pd.Series(log_model.coef_.flatten())
temp_df.sort_values(by='Coefficients',ascending=False)

The interpretation goes like this: when we change the value of a feature by 1 unit, the model equation will produce two odds; one is the base, and the other is the incremental value of the feature. We are looking at the ratio of odds changing with every increase or decrease in the value of a feature. From the global feature importance, there are three important features: the number of customer service calls, the total minutes for the day, and the number of voicemail messages.

Recipe 2-15. Local Explanations Using LIME

Problem

You want to get faster explanations from both global and local explainable libraries.

Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, LIME is very useful to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model.

How It Works

Let’s take a look at the following script:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(xtrain),
                    feature_names=list(xtrain.columns),
                    class_names=['target_churn_dum'],
                    verbose=True, mode='classification')
# this record is a no churn scenario
exp = explainer.explain_instance(xtest.iloc[0], log_model.predict_proba, num_features=16)
exp.as_list()
Intercept -0.005325152786766457
Prediction_local [0.38147987]
Right: 0.32177492114146566
X does not have valid feature names, but LogisticRegression was fitted with feature names
[('number_customer_service_calls > 2.00', 0.1530891322197175),
 ('total_day_minutes > 213.80', 0.11114575899827552),
 ('number_vmail_messages <= 0.00', 0.09610037835765535),
 ('total_intl_calls <= 3.00', 0.03177016778340472),
 ('total_day_calls <= 86.00', 0.029375047698073507),
 ('99.00 < total_night_calls <= 113.00', -0.023964881054121437),
 ('account_length > 126.00', -0.015756474385902122),
 ('88.00 < total_eve_calls <= 101.00', 0.008756083756550214),
 ('total_intl_minutes <= 8.60', -0.007205495334049559),
 ('200.00 < total_eve_minutes <= 232.00', 0.004122691218360631),
 ('total_intl_charge <= 2.32', -0.0013747713519713068),
 ('total_day_charge > 36.35', 0.0010811737941700244),
 ('200.20 < total_night_minutes <= 234.80', -0.00013400510199346275),
 ('0.00 < area_code_tr <= 1.00', -8.127174069198377e-05),
 ('9.01 < total_night_charge <= 10.57', -6.668417986225894e-05),
 ('17.00 < total_eve_charge <= 19.72', -5.18320207196282e-05)]
pd.DataFrame(exp.as_list())

0

1

 

0

number_customer_service_calls > 2.00

0.153089

1

total_day_minutes > 213.80

0.111146

2

number_vmail_messages <= 0.00

0.096100

3

total_intl_calls <= 3.00

0.031770

4

total_day_calls <= 86.00

0.029375

5

99.00 < total_night_calls <= 113.00

-0.023965

6

account_length > 126.00

-0.015756

7

88.00 < total_eve_calls <= 101.00

0.008756

8

total_intl_minutes <= 8.60

-0.007205

9

200.00 < total_eve_minutes <= 232.00

0.004123

10

total_intl_charge <= 2.32

-0.001375

11

total_day_charge > 36.35

0.001081

12

200.20 < total_night_minutes <= 234.80

-0.000134

13

0.00 < area_code_tr <= 1.00

-0.000081

14

9.01 < total_night_charge <= 10.57

-0.000067

exp.show_in_notebook(show_table=True)

A graphical representation has 2 bars for prediction probabilities. A column for not defined, and a table with 2 columns for feature and value.

Figure 2-17

Local explanation for record number 1

# This is s churn scenario
exp = explainer.explain_instance(xtest.iloc[20], log_model.predict_proba, num_features=16)
exp.as_list()
ntercept -0.02171544428872446
Prediction_local [0.44363396]
Right: 0.4309152994720991
X does not have valid feature names, but LogisticRegression was fitted with feature names
[('number_customer_service_calls > 2.00', 0.15255665525554568),
 ('total_day_minutes > 213.80', 0.11572355524257688),
 ('number_vmail_messages <= 0.00', 0.09656802173637159),
 ('total_night_calls <= 86.00', 0.07347814323553245),
 ('total_day_calls <= 86.00', 0.03143722302975322),
 ('total_eve_minutes <= 166.20', -0.016279347282555784),
 ('88.00 < total_eve_calls <= 101.00', 0.01202796623602075),
 ('4.00 < total_intl_calls <= 5.00', -0.008862308197327355),
 ('72.00 < account_length <= 98.00', 0.008095316213066618),
 ('total_intl_minutes > 12.00', 0.004036225959225672),
 ('200.20 < total_night_minutes <= 234.80', 0.0031930707578459207),
 ('total_intl_charge > 3.24', -0.0025561403383019586),
 ('total_day_charge > 36.35', -0.0021799602467677667),
 ('9.01 < total_night_charge <= 10.57', -0.001598247181850764),
 ('total_eve_charge <= 14.13', -0.001066803177182677),
 ('area_code_tr > 1.00', 0.0007760299764712853)]

In a similar fashion, the graphs can be generated for different data points from the training sample as well as the test sample.

Recipe 2-16. Model Explanations Using ELI5

Problem

You want to get model explanations using the ELI5 library.

Solution

ELI5 provides two functions to show weights and make predictions to generate model explanations.

How It Works

Let’s take a look at the following script:
eli5.show_weights(log_model,
                 feature_names=list(xtrain.columns))
y=1 top features

Weight?

Feature

+0.449

number_customer_service_calls

+0.010

total_day_minutes

+0.009

total_intl_minutes

+0.002

total_intl_charge

+0.002

total_eve_minutes

+0.001

total_day_charge

+0.000

total_eve_charge

-0.000

total_night_charge

-0.001

total_night_minutes

-0.002

account_length

-0.006

area_code_tr

-0.008

total_day_calls

-0.017

total_eve_calls

-0.017

total_night_calls

-0.034

<BIAS>

-0.037

number_vmail_messages

-0.087

total_intl_calls

eli5.explain_weights(log_model, feature_names=list(xtrain.columns))
y=1 top features

Weight?

Feature

+0.449

number_customer_service_calls

+0.010

total_day_minutes

+0.009

total_intl_minutes

+0.002

total_intl_charge

+0.002

total_eve_minutes

+0.001

total_day_charge

+0.000

total_eve_charge

-0.000

total_night_charge

-0.001

total_night_minutes

-0.002

account_length

-0.006

area_code_tr

-0.008

total_day_calls

-0.017

total_eve_calls

-0.017

total_night_calls

-0.034

<BIAS>

-0.037

number_vmail_messages

-0.087

total_intl_calls

eli5.explain_prediction(log_model,xtrain.iloc[60])
y=0 (probability 0.788, score -1.310) top features

Contribution?

Feature

+2.458

total_night_calls

+1.289

total_eve_calls

+0.698

total_day_calls

+0.304

account_length

+0.174

total_intl_calls

+0.127

total_night_minutes

+0.034

<BIAS>

+0.006

area_code_tr

+0.002

total_night_charge

-0.004

total_intl_charge

-0.005

total_eve_charge

-0.057

total_intl_minutes

-0.064

total_day_charge

-0.304

total_eve_minutes

-0.449

number_customer_service_calls

-2.899

total_day_minutes

from eli5.sklearn import PermutationImportance
perm = PermutationImportance(log_model)
perm.fit(xtest, ytest)
eli5.show_weights(perm,feature_names=list(xtrain.columns))

Weight

Feature

0.0066 ± 0.0139

number_customer_service_calls

0.0066 ± 0.0024

number_vmail_messages

0.0030 ± 0.0085

total_eve_calls

0.0030 ± 0.0085

total_day_minutes

0.0006 ± 0.0088

total_day_calls

0 ± 0.0000

area_code_tr

0 ± 0.0000

total_intl_charge

0 ± 0.0000

total_night_charge

0 ± 0.0000

total_eve_charge

-0.0012 ± 0.0048

total_intl_calls

-0.0012 ± 0.0029

total_intl_minutes

-0.0024 ± 0.0096

account_length

-0.0024 ± 0.0024

total_day_charge

-0.0036 ± 0.0045

total_night_minutes

-0.0042 ± 0.0061

total_eve_minutes

-0.0048 ± 0.0072

total_night_calls

Conclusion

In this chapter, we covered how to interpret linear supervised models such as regression and classification. The linear models are simpler to interpret at a global level, meaning at a feature importance level, but hard to explain at a local interpretation level. In this chapter, we looked at local interpretation for samples using the SHAP, ELI5, and LIME libraries.

In the next chapter, we will cover the local and global interpretations for nonlinear models. The nonlinear models cover nonlinearity existing in data and thereby can be complex to interpret. Hence, we need a set of frameworks to explain the nonlinearity in a model.

References

  1. 1.

    Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
35.170.81.33