A supervised learning model is a model that is used to train an algorithm to map input data to output data. A supervised learning model can be of two types: regression or classification. In a regression scenario, the output variable is numerical, whereas with classification, the output variable is binary or multinomial. A binary output variable has two outcomes, such as true and false, accept and reject, yes and no, etc. In the case of multinomial output variables, the outcome can be more than two, such as high, medium, and low. In this chapter, we are going to use explainable libraries to explain a regression model and a classification model, while training a linear model.

In the classical predictive modeling scenario, a function has been identified, and the input data is usually fit to the function to produce the output, where the function is usually predetermined. In a modern predictive modeling scenario, the input data and output are both shown to a group of functions, and the machine identifies the best function that approximates well to the output given a particular set of input. There is a need to explain the output of machine learning and deep learning models when performing regression and classification tasks. Linear regression and linear classification models are simpler to explain.

The goal of this chapter is to introduce various explainability libraries for linear models such as feature importance, partial dependency plot, and local interpretation.

## Recipe 2-1. SHAP Values for a Regression Model on All Numerical Input Variables

### Problem

You want to explain a regression model built on all the numeric features of a dataset.

### Solution

A regression model on all the numeric features is trained, and then the trained model will be passed through SHAP to generate global explanations and local explanations.

### How It Works

- 1.
SHAP requires model retraining on all feature subsets; hence, usually it takes time if the explanation has to be generated for larger datasets.

- 2.
Identify a feature set from a list of features (let’s say there are 15 features, and we can select a subset with 5 features).

- 3.
For any particular feature, two models using the subset of features will be created, one with the feature and another without the feature.

- 4.
Then the prediction differences will be computed.

- 5.
The differences in prediction are computed for all possible subsets of features.

- 6.
The weighted average value of all possible differences is used to populate the feature importance.

If the weight of the feature is 0.000, then we can conclude that the feature is not important and has not joined the model. If it is not equal to 0.000, then we can conclude that the feature has a role to play in the prediction process.

We are going to use a dataset from the UCI machine learning repository. The URL to access the dataset is as follows:

https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

Feature Description from the Energy Prediction Dataset

Feature Name | Description | Unit |
---|---|---|

Appliances | Energy use | In Wh |

Lights | Energy use of light fixtures in the house | In Wh |

T1 | Temperature in kitchen area | In Celsius |

RH_1 | Humidity in kitchen area | In % |

T2 | Temperature in living room area | In Celsius |

RH_2 | Humidity in living room area | In % |

T3 | Temperature in laundry room area | |

RH_3 | Humidity in laundry room area | In % |

T4 | Temperature in office room | In Celsius |

RH_4 | Humidity in office room | In % |

T5 | Temperature in bathroom | In Celsius |

RH_5 | Humidity in bathroom | In % |

T6 | Temperature outside the building (north side) | In Celsius |

RH_6 | Humidity outside the building (north side) | In % |

T7 | Temperature in ironing room | In Celsius |

RH_7 | Humidity in ironing room | In % |

T8 | Temperature in teenager room 2 | In Celsius |

RH_8 | Humidity in teenager room 2 | In % |

T9 | Temperature in parents room | In Celsius |

RH_9 | Humidity in parents room | In % |

To | Temperature outside (from the Chievres weather station) | In Celsius |

Pressure (from Chievres weather station) | In mm Hg | |

aRH_out | Humidity outside (from the Chievres weather station) | In % |

Wind speed (from Chievres weather station) | In m/s | |

Visibility (from Chievres weather station) | In km | |

Tdewpoint (from Chievres weather station) | Â°C | |

rv1 | Random variable 1 | Nondimensional |

rv2 | Random variable 2 | Nondimensional |

This part of the script takes time as it is a computationally intensive process. The explainer function calculates permutations, which means taking a feature set and generating the prediction difference. This difference is the presence of one feature and the absence of the same feature. For faster calculation, we can reduce the sample size to a smaller set, let’s say 1,000 or 2,000. In the previous script, we are using the entire population of 19,735 records to calculate the SHAP values. This part of the script can be improved by applying Python multiprocessing, which is beyond the scope of this chapter.

The SHAP value for a specific feature ? is just the difference between the expected model output and the partial dependence plot at the feature’s value ??. One of the fundamental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present and the game outcome when no players are present. For machine learning models, this means that SHAP values of all the input features will always sum up to the difference between the baseline (expected) model output and the current model output for the prediction being explained.

## Recipe 2-2. SHAP Partial Dependency Plot for a Regression Model

### Problem

You want to get a partial dependency plot from SHAP.

### Solution

The solution to this problem is to use the partial dependency method (partial_dependence_plot) from the model.

### How It Works

^{th}record is selected from the training dataset. Figure 2-1 shows the partial dependency superimposed with the 20

^{th}record in red.

^{th}record is 140 Wh. The most influential feature impacting the 20

^{th}record is RH_1, which is the humidity in the kitchen area in percentage, and RH_2, which is the humidity in the living room area. On the bottom of Figure 2-3, there are 14 features that are not very important for the 20

^{th}record’s predicted value.

## Recipe 2-3. SHAP Feature Importance for Regression Model with All Numerical Input Variables

### Problem

You want to calculate the feature importance using the SHAP values.

### Solution

The solution to this problem is to use SHAP absolute values from the model.

### How It Works

All the feature importance values are not scaled; hence, sum of values from all features will not be totaling 100.

## Recipe 2-4. SHAP Values for a Regression Model on All Mixed Input Variables

### Problem

How do you estimate SHAP values when you introduce the categorical variables along with the numerical variables, which is a mixed set of input features.

### Solution

The solution is that the mixed input variables that have numeric features as well as categorical or binary features can be modeled together. As the number of features increases, the time to compute all the permutations will also increase.

### How It Works

| |

| 11.933 |

| 11.933 |

| 11.933 |

## Recipe 2-5. SHAP Partial Dependency Plot for Regression Model for Mixed Input

### Problem

You want to plot the partial dependency plot and interpret the graph for numeric and categorical dummy variables.

### Solution

The partial dependency plot shows the correlation between the feature and the predicted output of the target variables. There are two ways we can showcase the results, one with a feature and expected value of the prediction function and the other with superimposing a data point on the partial dependency plot.

### How It Works

If the location is Kolkata, then the price is expected to be lower. The reason for the difference between the two locations is in the data that is being used to train the model. The previous three figures show the global importance of a feature versus the prediction function. As an example, only two features are taken into consideration; we can use all features one by one and display many graphs to get more understanding about the predictions.

^{th}record, the predicted price is 22.542, the powerBhp stands out to be most important feature, and manual transmission is the second most important feature.

## Recipe 2-6. SHAP Feature Importance for a Regression Model with All Mixed Input Variables

### Problem

You want to get the global feature importance from SHAP values using mixed-input feature data.

### Solution

The solution to this problem is to use absolute values and sort them in descending order.

### How It Works

At a high level, for the linear model that is used to predict the price of the automobiles, the previous features are important, with the highest being the powerBhp, age of the car, petrol type, manual transmission type, etc. The previous tabular output shows global feature importance.

## Recipe 2-7. SHAP Strength for Mixed Features on the Predicted Output for Regression Models

### Problem

You want to know the impact of a feature on the model function.

### Solution

The solution to this problem is to use a beeswarm plot that displays the blue and red points.

### How It Works

## Recipe 2-8. SHAP Values for a Regression Model on Scaled Data

### Problem

You don’t know whether getting SHAP values on scaled data is better than the unscaled numerical data.

### Solution

The solution to this problem is to use a numerical dataset and generate local and global explanations after applying the standard scaler to the data.

### How It Works

Permutation explainer | Time | |
---|---|---|

Unscaled data | 19736it | 15:22, 21.23it/s |

Scaled data | 19736it | 08:53, 36.22it/s |

## Recipe 2-9. LIME Explainer for Tabular Data

### Problem

You want to know how to generate explainability at a local level in a focused manner rather than at a global level.

### Solution

The solution to this problem is to use the LIME library. LIME is a model-agnostic technique; it retrains the ML model while running the explainer. LIME localizes a problem and explains the model at a local level.

### How It Works

## Recipe 2-10. ELI5 Explainer for Tabular Data

### Problem

You want to use the ELI5 library for generating explanations of a linear regression model.

### Solution

ELI5 is a Python package that helps to debug a machine learning model and explain the predictions. It provides support for all machine learning models supported by the scikit-learn library.

### How It Works

**y**top features

Weight? | Feature |
---|---|

| <BIAS> |

| RH_1 |

| T3 |

| T6 |

| Tdewpoint |

| RH_3 |

| T8 |

| Lights |

| RH_6 |

| Windspeed |

| T7 |

| |

| |

| RH_9 |

| T4 |

| RH_7 |

| RH_out |

| RH_8 |

| T9 |

| T2 |

| T_out |

| RH_2 |

The results table has a BIAS value as a feature. This can be interpreted as an intercept term for a linear regression model. Other features are listed based on the descending order of importance based on their weight. The show weights function provides a global interpretation of the model, and the show prediction function provides a local interpretation by taking into account a record from the training set.

## Recipe 2-11. How the Permutation Model in ELI5 Works

### Problem

You want to make sense of the ELI5 permutation library.

### Solution

The solution to this problem is to use a dataset and a trained model.

### How It Works

The permutation model in the ELI5 library works only for global interpretation. First, it takes a base line linear regression model from the training dataset and computes the error of the model. Then it shuffles the values of a feature and retrains the model and computes the error. It compares the decrease in error after shuffling and before shuffling. A feature can be considered as important if post shuffling the error delta is high and unimportant if the error delta is low. The result displays the average importance of features and the standard deviation of features with multiple shuffle steps.

## Recipe 2-12. Global Explanation for Logistic Regression Models

### Problem

You want to explain the predictions generated from a logistic regression model.

### Solution

The logistic regression model is also known as a classification model as we model the probabilities from either a binary classification or a multinomial classification variable. In this particular recipe, we are using a churn classification dataset that has two outcomes: whether the customer is likely to churn or not.

### How It Works

## Recipe 2-13. Partial Dependency Plot for a Classifier

### Problem

You want to show feature associations with the class probabilities.

### Solution

The class probabilities in this example are related to predicting the probability of churn. The SHAP value for a feature can be plotted against the feature value to show a scatter chart that displays the correlation (positive or negative) and strength of associations.

### How It Works

## Recipe 2-14. Global Feature Importance from the Classifier

### Problem

You want to get the global feature importance for the logistic regression model.

### Solution

The solution to this problem is to use a bar plot and beeswarm plot and heat map.

### How It Works

The interpretation goes like this: when we change the value of a feature by 1 unit, the model equation will produce two odds; one is the base, and the other is the incremental value of the feature. We are looking at the ratio of odds changing with every increase or decrease in the value of a feature. From the global feature importance, there are three important features: the number of customer service calls, the total minutes for the day, and the number of voicemail messages.

## Recipe 2-15. Local Explanations Using LIME

### Problem

You want to get faster explanations from both global and local explainable libraries.

### Solution

The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, LIME is very useful to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model.

### How It Works

0 | 1 | |
---|---|---|

| number_customer_service_calls > 2.00 | 0.153089 |

| total_day_minutes > 213.80 | 0.111146 |

| number_vmail_messages <= 0.00 | 0.096100 |

| total_intl_calls <= 3.00 | 0.031770 |

| total_day_calls <= 86.00 | 0.029375 |

| 99.00 < total_night_calls <= 113.00 | -0.023965 |

| account_length > 126.00 | -0.015756 |

| 88.00 < total_eve_calls <= 101.00 | 0.008756 |

| total_intl_minutes <= 8.60 | -0.007205 |

| 200.00 < total_eve_minutes <= 232.00 | 0.004123 |

| total_intl_charge <= 2.32 | -0.001375 |

| total_day_charge > 36.35 | 0.001081 |

| 200.20 < total_night_minutes <= 234.80 | -0.000134 |

| 0.00 < area_code_tr <= 1.00 | -0.000081 |

| 9.01 < total_night_charge <= 10.57 | -0.000067 |

In a similar fashion, the graphs can be generated for different data points from the training sample as well as the test sample.

## Recipe 2-16. Model Explanations Using ELI5

### Problem

You want to get model explanations using the ELI5 library.

### Solution

ELI5 provides two functions to show weights and make predictions to generate model explanations.

### How It Works

**y=1**top features

Weight | Feature |
---|---|

+0.449 | number_customer_service_calls |

+0.010 | total_day_minutes |

+0.009 | total_intl_minutes |

+0.002 | total_intl_charge |

+0.002 | total_eve_minutes |

+0.001 | total_day_charge |

+0.000 | total_eve_charge |

-0.000 | total_night_charge |

-0.001 | total_night_minutes |

-0.002 | account_length |

-0.006 | area_code_tr |

-0.008 | total_day_calls |

-0.017 | total_eve_calls |

-0.017 | total_night_calls |

-0.034 | <BIAS> |

-0.037 | number_vmail_messages |

-0.087 | total_intl_calls |

**y=1**top features

Weight | Feature |
---|---|

+0.449 | number_customer_service_calls |

+0.010 | total_day_minutes |

+0.009 | total_intl_minutes |

+0.002 | total_intl_charge |

+0.002 | total_eve_minutes |

+0.001 | total_day_charge |

+0.000 | total_eve_charge |

-0.000 | total_night_charge |

-0.001 | total_night_minutes |

-0.002 | account_length |

-0.006 | area_code_tr |

-0.008 | total_day_calls |

-0.017 | total_eve_calls |

-0.017 | total_night_calls |

-0.034 | <BIAS> |

-0.037 | number_vmail_messages |

-0.087 | total_intl_calls |

**y=0**(probability

**0.788**, score

**-1.310**) top features

Contribution | Feature |
---|---|

+2.458 | total_night_calls |

+1.289 | total_eve_calls |

+0.698 | total_day_calls |

+0.304 | account_length |

+0.174 | total_intl_calls |

+0.127 | total_night_minutes |

+0.034 | <BIAS> |

+0.006 | area_code_tr |

+0.002 | total_night_charge |

-0.004 | total_intl_charge |

-0.005 | total_eve_charge |

-0.057 | total_intl_minutes |

-0.064 | total_day_charge |

-0.304 | total_eve_minutes |

-0.449 | number_customer_service_calls |

-2.899 | total_day_minutes |

Weight | Feature |
---|---|

0.0066 ± 0.0139 | number_customer_service_calls |

0.0066 ± 0.0024 | number_vmail_messages |

0.0030 ± 0.0085 | total_eve_calls |

0.0030 ± 0.0085 | total_day_minutes |

0.0006 ± 0.0088 | total_day_calls |

0 ± 0.0000 | area_code_tr |

0 ± 0.0000 | total_intl_charge |

0 ± 0.0000 | total_night_charge |

0 ± 0.0000 | total_eve_charge |

-0.0012 ± 0.0048 | total_intl_calls |

-0.0012 ± 0.0029 | total_intl_minutes |

-0.0024 ± 0.0096 | account_length |

-0.0024 ± 0.0024 | total_day_charge |

-0.0036 ± 0.0045 | total_night_minutes |

-0.0042 ± 0.0061 | total_eve_minutes |

-0.0048 ± 0.0072 | total_night_calls |

## Conclusion

In this chapter, we covered how to interpret linear supervised models such as regression and classification. The linear models are simpler to interpret at a global level, meaning at a feature importance level, but hard to explain at a local interpretation level. In this chapter, we looked at local interpretation for samples using the SHAP, ELI5, and LIME libraries.

In the next chapter, we will cover the local and global interpretations for nonlinear models. The nonlinear models cover nonlinearity existing in data and thereby can be complex to interpret. Hence, we need a set of frameworks to explain the nonlinearity in a model.

## References

- 1.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.