Ensemble models are considered to be effective when individual models are failing to balance bias and variance for a training dataset. The predictions are aggregated in ensemble models to generate the final models. In the case of supervised regression models, many models are generated, and the averages of all the predictions are taken into consideration to generate the final prediction. Similarly, for supervised classification problems, multiple models are being trained, and each model generates a classification. The final model takes into account the majority voting rule criteria to decide the final prediction. Because of the nature of ensemble models, these are harder to explain to end users. That is why we need frameworks that can explain the ensemble models.
Ensemble means a grouping of the model predictions. There are three types of ensemble models: bagging, boosting, and stacking. Bagging means bootstrap aggregation, which means bootstrapping the available features, making a subset selection, generating predictions, continuing the same process a few times, and averaging the predictions to generate the final prediction. Random forest is one of the most important and popular bagging models.
Boosting is a sequential method of boosting the predictive power of the model. It starts with a base classifier being trained on data to predict and classify the output. In the next step, the correctly predicted cases are separated in an automatic fashion, and the rest of the cases are used for retraining the model. This process will continue until there is a scope to improve and boost the accuracy to a higher level. If it is not possible to boost the accuracy further, then the iteration should stop, and the final accuracy is reported.
Stacking is a process of generating predictions from different sets of models and averaging their predictions.
The goal of this chapter is to introduce various explainability libraries for ensemble models such as feature importance, partial dependency plot, and local interpretation and global interpretation of the models.
Recipe 4-1. Explainable Boosting Machine Interpretation
Problem
You want to explain the explainable boosting machine (EBM) as an ensemble model and interpret the global and local interpretations.
Solution
EBMs are a tree-based, cyclic, gradient descent–based boosting model known as a generalized additive model (GAM), which has automatic interaction detection. EBMs are interpretable though they are black box by nature. We need an additional library known as interpret core.
How It Works
- 1.
SHAP requires model retraining on all feature subsets; hence, usually it takes time if the explanation has to be generated for larger datasets.
- 2.
Identify a feature set from a list of features (let’s say there are 15 features; we can select a subset with 5 features).
- 3.
For any particular feature, two models using the subset of features will be created, one with the feature and another without the feature.
- 4.
The prediction differences will be computed.
- 5.
The differences in prediction are computed for all possible subsets of features.
- 6.
The weighted average value of all possible differences is used to populate the feature importance.
If the weight of the feature is 0.000, then we can conclude that the feature is not important and has not joined the model. If it is not equal to 0.000, then we can conclude that the feature has a role to play in the prediction process.
We are going to use a dataset from the UCI machine learning repository. The URL to access the dataset is as follows:
https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction
Feature Description from the Energy Prediction Dataset
Feature Name | Description | Unit |
---|---|---|
Appliances | Energy use | In Wh |
Lights | Energy use of light fixtures in the house | In Wh |
T1 | Temperature in kitchen area | In Celsius |
RH_1 | Humidity in kitchen area | In % |
T2 | Temperature in living room area | In Celsius |
RH_2 | Humidity in living room area | In % |
T3 | Temperature in laundry room area | |
RH_3 | Humidity in laundry room area | In % |
T4 | Temperature in office room | In Celsius |
RH_4 | Humidity in office room | In % |
T5 | Temperature in bathroom | In Celsius |
RH_5 | Humidity in bathroom | In % |
T6 | Temperature outside the building (north side) | In Celsius |
RH_6 | Humidity outside the building (north side) | In % |
T7 | Temperature in ironing room | In Celsius |
RH_7 | Humidity in ironing room | In % |
T8 | Temperature in teenager room 2 | In Celsius |
RH_8 | Humidity in teenager room 2 | In % |
T9 | Temperature in parents room | In Celsius |
RH_9 | Humidity in parents room | In % |
To | Temperature outside (from the Chievres weather station) | In Celsius |
Pressure (from Chievres weather station) | In mm Hg | |
aRH_out | Humidity outside (from the Chievres weather station) | In % |
Wind speed (from Chievres weather station) | In m/s | |
Visibility (from Chievres weather station) | In km | |
Tdewpoint (from Chievres weather station) | °C | |
rv1 | Random variable 1 | Nondimensional |
rv2 | Random variable 2 | Nondimensional |
Recipe 4-2. Partial Dependency Plot for Tree Regression Models
Problem
You want to get a partial dependency plot from a boosting model.
Solution
The solution to this problem is to use a partial dependency plot from the model using SHAP.
How It Works
Recipe 4-3. Explain a Extreme Gradient Boosting Model with All Numerical Input Variables
Problem
You want to explain the extreme gradient boosting–based regressor.
Solution
The XGB regressor can be explained using the SHAP library; we can populate the global and local interpretations.
How It Works
Recipe 4-4. Explain a Random Forest Regressor with Global and Local Interpretations
Problem
Random forest (RF) is a bagging approach to create ensemble models; it is also difficult to interpret which tree generated the final prediction and interpret the global and local interpretations.
Solution
We are going to use the tree explainer from the SHAP library.
How It Works
Recipe 4-5. Explain the Catboost Regressor with Global and Local Interpretations
Problem
Catboost is another model that fasten the model training process by explicitly declaring the categorical features. If there is no categorical feature, then the model is trained on all numeric features as well. You want to explain the global and local interpretations from the catboost regression model.
Solution
We are going to use the tree explainer from the SHAP library and the catboost library.
How It Works
Recipe 4-6. Explain the EBM Classifier with Global and Local Interpretations
Problem
EBM is an explainable boosting machine for a classifier. You want to explain the global and local interpretations from the EBM classifier model.
Solution
We are going to use the tree explainer from the SHAP library.
How It Works
Recipe 4-7. SHAP Partial Dependency Plot for Regression Models with Mixed Input
Problem
You want to plot the partial dependency plot and interpret the graph for numeric and categorical dummy variables.
Solution
The partial dependency plot shows the correlation between a feature and the predicted output of the target variables. There are two ways we can showcase the results, one with a feature and expected value of the prediction function and the other by superimposing a data point on the partial dependency plot.
How It Works
If the location is Kolkata, then the price is expected to be the same. There is no impact of the dummy variable on the price.
Recipe 4-8. SHAP Feature Importance for Tree Regression Models with Mixed Input Variables
Problem
You want to get the global feature importance from SHAP values using mixed input feature data.
Solution
The solution to this problem is to use absolute values, sort them in descending order, and populate them in waterfall chart, beeswarm chart, scatter plot, etc.
How It Works
At a high level, for the tree-based nonlinear model that is used to predict the price of the automobiles, the previous features are important. The highest is the powerBhp, age of the car, petrol type, manual transmission type, etc. The previous tabular output shows the global feature importance.
Recipe 4-9. Explaining the XGBoost Model
Problem
You want to generate explainability for an XGBoost model for regression.
Solution
The XGBoost regressor trained on 100 trees and with a max depth parameter of 3 using a dataset that contains both numerical and categorial features. The total number of features are 23; an ideal dataset for XGBoost would be where we have more than 50 features. However, that requires more computation time.
How It Works
Recipe 4-10. Random Forest Regressor for Mixed Data Types
Problem
You want to generate explainability for a random forest model using numeric as well as categorical features.
Solution
Random forest is useful when we have more features, say, more than 50; however, for this recipe, it is applied on 23 features. We could pick up a large dataset, but that would require more computations and may take more time to train. So, be cognizant about the model configurations when the model is being trained on a smaller machine.
How It Works
Recipe 4-11. Explaining the Catboost Model
Problem
You want to generate explainability for a dataset where most of the features are categorical. We can use a boosting model where a lot of variables are categorical.
Solution
The catboost model is known to work when we have more categorical variables compared to numeric variables. Hence, we can use the catboost regressor.
How It Works
Recipe 4-12. LIME Explainer for the Catboost Model and Tabular Data
Problem
You want to generate explainability at a local level in a focused manner rather than at a global level.
Solution
The solution to this problem is to use the LIME library. LIME is a model-agnostic technique; it retrains the ML model while running the explainer. LIME localizes a problem and explains the model at a local level.
How It Works
Recipe 4-13. ELI5 Explainer for Tabular Data
Problem
You want to use the ELI5 library for generating explanations of a linear regression model.
Solution
ELI5 is a Python package that helps to debug a machine learning model and explain the predictions. It provides support for all machine learning models supported by the scikit-learn library.
How It Works
Weight | Feature |
---|---|
0.4385 | powerBhp |
0.2572 | Age |
0.0976 | engineCC |
0.0556 | Odometer |
0.0489 | Mileage |
0.0396 | Transmission_Manual |
0.0167 | FuelType_Petrol |
0.0165 | FuelType_Diesel |
0.0104 | Location_Hyderabad |
0.0043 | Location_Coimbatore |
0.0043 | Location_Kolkata |
0.0035 | Location_Kochi |
0.0025 | Location_Bangalore |
0.0021 | Location_Mumbai |
0.0014 | Location_Delhi |
0.0006 | OwnerType_Third |
0.0003 | OwnerType_Second |
0.0000 | FuelType_Electric |
0 | OwnerType_Fourth +ACY- Above |
0 | Location_Pune |
Weight | Feature |
---|---|
0.4385 | powerBhp |
0.2572 | Age |
0.0976 | engineCC |
0.0556 | Odometer |
0.0489 | Mileage |
0.0396 | Transmission_Manual |
0.0167 | FuelType_Petrol |
0.0165 | FuelType_Diesel |
0.0104 | Location_Hyderabad |
0.0043 | Location_Coimbatore |
0.0043 | Location_Kolkata |
0.0035 | Location_Kochi |
0.0025 | Location_Bangalore |
0.0021 | Location_Mumbai |
0.0014 | Location_Delhi |
0.0006 | OwnerType_Third |
0.0003 | OwnerType_Second |
0.0000 | FuelType_Electric |
0 | OwnerType_Fourth +ACY- Above |
0 | Location_Pune |
Weight | Feature |
---|---|
0.6743 ± 0.0242 | powerBhp |
0.2880 ± 0.0230 | Age |
0.1188 ± 0.0068 | engineCC |
0.0577 ± 0.0028 | Transmission_Manual |
0.0457 ± 0.0048 | Odometer |
0.0354 ± 0.0053 | mileage |
0.0134 ± 0.0018 | Location_Hyderabad |
0.0082 ± 0.0022 | FuelType_Petrol |
0.0066 ± 0.0013 | FuelType_Diesel |
0.0042 ± 0.0010 | Location_Kochi |
0.0029 ± 0.0006 | Location_Kolkata |
0.0023 ± 0.0010 | Location_Coimbatore |
0.0017 ± 0.0002 | Location_Bangalore |
0.0014 ± 0.0005 | Location_Mumbai |
0.0014 ± 0.0006 | Location_Delhi |
0.0007 ± 0.0001 | OwnerType_Third |
0.0002 ± 0.0000 | OwnerType_Second |
0.0000 ± 0.0000 | FuelType_Electric |
0 ± 0.0000 | Location_Chennai |
0 ± 0.0000 | FuelType_LPG |
The results table has a BIAS value as a feature. This can be interpreted as an intercept term for a linear regression model. Other features are listed based on their descending order of importance based on their weight. The show weights function provides a global interpretation of the model, and the show prediction function provides local interpretation by taking into account a record from the training set.
Recipe 4-14. How the Permutation Model in ELI5 Works
Problem
You want to make sense of the ELI5 permutation library.
Solution
The solution to this problem is to use a dataset and a trained model.
How It Works
The permutation model in the ELI5 library works only for global interpretation. First it takes a baseline model from the training dataset and computes the error of the model. Then it shuffles the values of a feature, retrains the model, and computes the error. It compares the decrease in error after shuffling and before shuffling. A feature can be considered as important if after shuffling the error delta is high, and unimportant if the error delta is low. The result displays the average importance of features and the standard deviation of features with multiple shuffle steps.
Recipe 4-15. Global Explanation for Ensemble Classification Models
Problem
You want to explain the predictions generated from a classification model using ensemble models.
Solution
The logistic regression model is also known as a classification model as we model the probabilities from either a binary classification or a multinomial classification variable. In this particular recipe, we are using a churn classification dataset that has two outcomes: whether the customer is likely to churn or not. Let’s use the ensemble models such as the explainable boosting machine for the classifier, extreme gradient boosting classifier, random forest classifier, and catboost classifier.
How It Works
Recipe 4-16. Partial Dependency Plot for a Nonlinear Classifier
Problem
You want to show feature associations with the class probabilities using a nonlinear classifier.
Solution
The class probabilities in this example are related to predicting the probability of churn. The SHAP value for a feature can be plotted against the feature value to show a scatter chart that displays the correlation, positive or negative, and the strength of associations.
How It Works
Recipe 4-17. Global Feature Importance from the Nonlinear Classifier
Problem
You want to get the global feature importance for the decision tree classification model.
Solution
The solution to this problem is to use the explainer log odds.
How It Works
Recipe 4-18. XGBoost Model Explanation
Problem
You want to explain an extreme gradient boosting model, which is a sequential boosting model.
Solution
The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. We will take a subset if the smaller machine is allocated and a full dataset if the machine configuration supports it.
How It Works
Recipe 4-19. Explain a Random Forest Classifier
Problem
You want to get faster explanations from global and local explainable libraries using a random forest classifier. A random forest creates a family of trees as estimators and averages the predictions using the majority voting rule.
Solution
The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations.
How It Works
Recipe 4-20. Catboost Model Interpretation for Classification Scenario
Problem
You want to get an explanation for the catboost model–based binary classification problem.
Solution
The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model, LIME is very useful. Catboost needs iterations to be defined.
How It Works
Recipe 4-21. Local Explanations Using LIME
Problem
You want to get faster explanations from global and local explainable libraries.
Solution
The model explanation can be done using SHAP; however, one of the limitations of SHAP is we cannot use the full data to create global and local explanations. Even if we decide to use the full data, it usually takes more time. Hence, to speed up the process of generating local and global explanations in a scenario when millions of records are being used to train a model, LIME is very useful.
How It Works
In a similar fashion, the graphs can be generated for different records from the training set and test set, which are from the training sample as well as test sample.
Recipe 4-22. Model Explanations Using ELI5
Problem
You want to get model explanations using the ELI5 library.
Solution
ELI5 provides two functions, show weights and show predictions, to generate model explanations.
How It Works
Weight | Feature |
---|---|
0.3703 | total_day_minutes |
0.2426 | number_customer_service_calls |
0.1181 | total_day_charge |
0.0466 | total_eve_charge |
0.0427 | number_vmail_messages |
0.0305 | total_eve_minutes |
0.0264 | total_eve_calls |
0.0258 | total_intl_minutes |
0.0190 | total_night_minutes |
0.0180 | total_night_charge |
0.0139 | total_intl_charge |
0.0133 | area_code_tr |
0.0121 | total_day_calls |
0.0110 | total_intl_calls |
0.0077 | total_night_calls |
0.0019 | account_length |
Weight | Feature |
---|---|
0.3703 | total_day_minutes |
0.2426 | number_customer_service_calls |
0.1181 | total_day_charge |
0.0466 | total_eve_charge |
0.0427 | number_vmail_messages |
0.0305 | total_eve_minutes |
0.0264 | total_eve_calls |
0.0258 | total_intl_minutes |
0.0190 | total_night_minutes |
0.0180 | total_night_charge |
0.0139 | total_intl_charge |
0.0133 | area_code_tr |
0.0121 | total_day_calls |
0.0110 | total_intl_calls |
0.0077 | total_night_calls |
0.0019 | account_length |
Weight | Feature |
---|---|
0.0352 ± 0.0051 | total_day_minutes |
0.0250 ± 0.0006 | total_day_charge |
0.0121 ± 0.0024 | number_vmail_messages |
0.0110 ± 0.0051 | total_eve_charge |
0.0052 ± 0.0048 | total_night_minutes |
0.0028 ± 0.0025 | total_night_charge |
0.0023 ± 0.0009 | total_eve_calls |
0.0022 ± 0.0012 | number_customer_service_calls |
0.0022 ± 0.0018 | total_eve_minutes |
0.0019 ± 0.0012 | total_night_calls |
0.0018 ± 0.0015 | total_day_calls |
0.0017 ± 0.0019 | total_intl_minutes |
0.0011 ± 0.0016 | area_code_tr |
0.0008 ± 0.0012 | total_intl_charge |
0.0005 ± 0.0018 | total_intl_calls |
-0.0010 ± 0.0018 | account_length |
Recipe 4-23. Multiclass Classification Model Explanation
Problem
You want to get model explanations for multiclass classification problems.
Solution
The expectation for multiclass classification is to first build a robust model with categorical features, if any, and explain the predictions. In a binary class classification problem, we can get the probabilities, and sometimes we can get the feature importance corresponding to each class from all kinds of ensemble models. Here is an example of a catboost model that can be used to generate the feature importance corresponding to each class in the multiclass classification problem.
How It Works
Conclusion
In this chapter, we discussed the ensemble model explanations. The models we covered were explainable boosting regressor, explainable boosting classifier, extreme gradient boosting regressor and classifier, random forest regressor and classifier, and catboost classifier and regressor. The graphs and charts sometimes may look similar, but they are different, because of two reasons. First, the data points from SHAP that are available for plotting depend on the sample size selected to generate explanations. Second, the sample models are being trained with fewer iterations and with basic hyperparameters; hence, with a higher configuration machine, the full hyperparameter tuning can happen, and better SHAP values can be produced.
In the next chapter, we will cover the explainability for natural language–based tasks such as text classification and sentiment analysis and explain the predictions.