IML with Pima Indians and the Titanic datasets#

source: Parul Pandey on Kaggle

Source: interpretable-ml-book

In his book ‘Interpretable Machine Learning’, Christoph Molnar beautifully encapsulates the essence of ML interpretability through this example: Imagine you are a Data Scientist and in your free time you try to predict where your friends will go on vacation in the summer based on their facebook and twitter data you have. Now, if the predictions turn out to be accurate, your friends might be impressed and could consider you to be a magician who could see the future.

If the predictions are wrong, it would still bring no harm to anyone except to your reputation of being a “Data Scientist”. Now let’s say it wasn’t a fun project and there were investments involved. Say, you wanted to invest in properties where your friends were likely to holiday. What would happen if the model’s predictions went awry?You would lose money. As long as the model is having no significant impact, it’s interpretability doesn’t matter so much but when there are implications involved based on a model’s prediction, be it financial or social, interpretability becomes relevant.

Importance of interpretability#

The question that some of the people often ask is why aren’t we just content with the results of the model and why are we so hell-bent on knowing why a particular decision was made? A lot of this has to do with the impact that a model might have in the real world. For models which are merely meant to recommend movies will have a far less impact than the ones created to predict the outcome of a drug.

The big picture of explainable machine learning.

Some of the benefits that interpretability brings along are:

Model Explainability techniques#

Let’s discuss a few techniques which help in extracting the above insights from a model. These techniques in the order in which they are discussed are as follows:

ELI5 library

Partial Dependence Plots

SHAP Values

Advanced Uses of SHAP Values

SHAP Summary Plots

LIME

Objective#

The objective of the notebook is to use techniques to explain how different features have different effect on the prediction whether or not a patient has diabetes, based on certain diagnostic measurements.

# Loading necessary libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb    
# Reading in the data
df = pd.read_csv('./data/diabetes.csv')
df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
# Creating the target and the features column and splitting the dataset into test and train set.

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

1. Permutation Importance using ELI5 library#

What features does a model think are important? Which features might have a greater impact on the model predictions than the others? This concept is called feature importance and Permutation Importance is a technique used widely for calculating feature importance. It helps us to see when our model produces counterintuitive results, and it helps to show the others when our model is working as we’d hope.

Permutation Importance works for many scikit-learn estimators. The idea is simple: Randomly permutate or shuffle a single column in the validation dataset leaving all the other columns intact. A feature is considered “important” if the model’s accuracy drops a lot and causes an increase in error. On the other hand, a feature is considered ‘unimportant’ if shuffling its values don’t affect the model’s accuracy.

  • Permutation importance is useful useful for debugging, understanding your model, and communicating a high-level overview from your model.

  • Permutation importance is calculated after a model has been fitted. So, let’s train and fit a RandomForestClassifier model denoted as my_model on the training data.

    • Permutation Importance is calculated using the ELI5 library. ELI5 is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has built-in support for several ML frameworks and provides a way to explain black-box models.

# Training and fitting a Random Forest Model
my_model = RandomForestClassifier(n_estimators=100,
                                  random_state=0).fit(X_train, y_train)
# Calculating and Displaying importance using the eli5 library
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(X_test,y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[6], line 2
      1 # Calculating and Displaying importance using the eli5 library
----> 2 import eli5
      3 from eli5.sklearn import PermutationImportance
      5 perm = PermutationImportance(my_model, random_state=1).fit(X_test,y_test)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/__init__.py:13
      6 from .formatters import (
      7     format_as_html,
      8     format_html_styles,
      9     format_as_text,
     10     format_as_dict,
     11 )
     12 from .explain import explain_weights, explain_prediction
---> 13 from .sklearn import explain_weights_sklearn, explain_prediction_sklearn
     14 from .transform import transform_feature_names
     17 try:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/__init__.py:3
      1 # -*- coding: utf-8 -*-
      2 from __future__ import absolute_import
----> 3 from .explain_weights import (
      4     explain_weights_sklearn,
      5     explain_linear_classifier_weights,
      6     explain_linear_regressor_weights,
      7     explain_rf_feature_importance,
      8     explain_decision_tree,
      9 )
     10 from .explain_prediction import (
     11     explain_prediction_sklearn,
     12     explain_prediction_linear_classifier,
     13     explain_prediction_linear_regressor,
     14 )
     15 from .unhashing import (
     16     InvertableHashingVectorizer,
     17     FeatureUnhasher,
     18     invert_hashing_and_fit,
     19 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/explain_weights.py:78
     73 from eli5.transform import transform_feature_names
     74 from eli5._feature_importances import (
     75     get_feature_importances_filtered,
     76     get_feature_importance_explanation,
     77 )
---> 78 from .permutation_importance import PermutationImportance
     81 LINEAR_CAVEATS = """
     82 Caveats:
     83 1. Be careful with features which are not
   (...)
     90    classification result for most examples.
     91 """.lstrip()
     93 HASHING_CAVEATS = """
     94 Feature names are restored from their hashes; this is not 100% precise
     95 because collisions are possible. For known collisions possible feature names
   (...)
     99 the result is positive.
    100 """.lstrip()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/permutation_importance.py:7
      5 import numpy as np
      6 from sklearn.model_selection import check_cv
----> 7 from sklearn.utils.metaestimators import if_delegate_has_method
      8 from sklearn.utils import check_array, check_random_state
      9 from sklearn.base import (
     10     BaseEstimator,
     11     MetaEstimatorMixin,
     12     clone,
     13     is_classifier
     14 )

ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/metaestimators.py)

Interpretation

  • The features at the top are most important and at the bottom, the least. For this example, ‘Glucose level’ is the most important feature which decides whether a person will have diabetes, which also makes sense.

  • The number after the ± measures how performance varied from one-reshuffling to the next.

  • Some weights are negative. This is because in those cases predictions on the shuffled data were found to be more accurate than the real data.

2. Partial Dependence Plots#

The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model( J. H. Friedman 2001). PDPs show how a feature affects predictions. PDP can show the relationship between the target and the selected features via 1D or 2D plots.

# training and fitting a Decision Tree
from sklearn.tree import DecisionTreeClassifier
feature_names = [i for i in df.columns]
tree_model = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']
# Let's plot a decision tree source : #https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn
# Since there are a lot of attributes, it is difficult to actually make sense of the decision tree graph in this notebook. 
# It is advised to export it as png and view it.

from sklearn import tree
import graphviz

tree_graph = tree.export_graphviz(tree_model, out_file=None, feature_names=feature_names,filled = True)
graphviz.Source(tree_graph)
../_images/06bd0ca8afd9ae770ea711fb6bfc048f2c03d546febdfadfd82f116cc06f29fd.svg

The library to be used for plotting PDPs is called python partial dependence plot toolbox or simply [PDPbox]. PDPs are also calculated after a model has been fit. In our dataset there are a lot of features like Glucose, BLood Pressure, Age etc. We start by considering a single row.

We proceed by fitting our model and calculating the probability of a person having diabetes which is our target variable. Next, we would choose a variable and continuously alter its value. For instance, we will calculate the outcome if the person’s insulin level is 50,100,150 and so on. All these values are then plotted and we get a graph of predicted vs actual outcome.(https://pdpbox.readthedocs.io/en/latest/)

from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_test, model_features=feature_names, feature='Glucose')

# plot it
pdp.pdp_plot(pdp_goals, 'Glucose')
plt.show()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[10], line 1
----> 1 from pdpbox import pdp, get_dataset, info_plots
      3 # Create the data that we will plot
      4 pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_test, model_features=feature_names, feature='Glucose')

ImportError: cannot import name 'get_dataset' from 'pdpbox' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pdpbox/__init__.py)
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_test, model_features=feature_names, feature='Insulin')

# plot it
pdp.pdp_plot(pdp_goals, 'Insulin')
plt.show()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 2
      1 # Create the data that we will plot
----> 2 pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=X_test, model_features=feature_names, feature='Insulin')
      4 # plot it
      5 pdp.pdp_plot(pdp_goals, 'Insulin')

AttributeError: module 'pdpbox.pdp' has no attribute 'pdp_isolate'

Interpretation

  • The Y-axis represents the change in prediction from what it would be predicted at the baseline or leftmost value.

  • Blue area denotes the confidence interval

  • For the ‘Glucose’ graph, we observe that probability of a person having diabetes steeply increases as the glucose level goes beyond 140 and then the probability remains high.

We can also visualize the partial dependence of two features at once using 2D Partial plots.

features_to_plot = ['Glucose','Insulin']
inter1  =  pdp.pdp_interact(model=tree_model, dataset=X_test, model_features=feature_names, features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour', plot_pdp=True)
plt.show()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[12], line 2
      1 features_to_plot = ['Glucose','Insulin']
----> 2 inter1  =  pdp.pdp_interact(model=tree_model, dataset=X_test, model_features=feature_names, features=features_to_plot)
      4 pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour', plot_pdp=True)
      5 plt.show()

AttributeError: module 'pdpbox.pdp' has no attribute 'pdp_interact'

3. SHAP Values#

SHAP which stands for SHapley Additive exPlanation, helps to break down a prediction to show the impact of each feature. It is based on Shapley values, a technique used in game theory to determine how much each player in a collaborative game has contributed to its success¹. Normally, getting the trade-off between accuracy and interpretability just right can be a difficult balancing act but SHAP values can deliver both.

SHAP values interpret the impact of having a certain value for a given feature in comparison to the prediction we’d make if that feature took some baseline value.

Shap values show how much a given feature changed our prediction (compared to if we made that prediction at some baseline value of that feature). Let’s say we wanted to know what was the prediction when the insulin level was 150 instead of some fixed baseline no. If we are able to answer this, we could perform the same steps for other features as follows:

sum(SHAP values for all features) = pred_for_team - pred_for_baseline_values

row_to_show = 10
data_for_prediction = X_test.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)

tree_model.predict_proba(data_for_prediction_array)
array([[0., 1.]])

SHAP values are calculated using the Shap library which can be installed easily from PyPI or conda. SHAP values interpret the impact of having a certain value for a given feature in comparison to the prediction we’d make if that feature took some baseline value.

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(tree_model)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Interpretation

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue

  • The base_value here is 0.3576 while our predicted value is 1.

  • Glucose = 158 has the biggest impact on increasing the prediction, while

  • Age feature has the biggest effect in decreasing the prediction.

4. Advanced Uses of SHAP Values#

Aggregating many SHAP values can provide even more detailed insights into the model.

  • SHAP Summary Plots

To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The summary plot tells which features are most important, and also their range of effects over the dataset.

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(tree_model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
shap_values = explainer.shap_values(X_test)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1],X_test)
../_images/80c6a2fce4123a1b8a2c464b0ab46d64205e2c8a877a59643237c4bfa9e67a4d.png

For every dot:

  • Vertical location shows what feature it is depicting

  • Color shows whether that feature was high or low for that row of the dataset

  • Horizontal location shows whether the effect of that value caused a higher or lower prediction.

The point in the upper left was depicts a person whose glucose level is less thereby reducing the prediction of diabetes by 0.4.

5. LIME: Locally Interpretable Model-Agnostic Explanations#

The framework LIME is introduced in a paper titled: “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. This paper talks about an algorithm that is capable of explaining the predicitons of a ML model(classification/Regression problem) in a reasonable way.LIME essentially approximates the predictions of the blackbox models locally with an intrepretable model.

LIME attempts to play the role of the ‘explainer’, explaining predictions for each data sample. Source

To make it more understandable, I shall be using the Titanic dataset for this part of the notebook.

import lime
from lime.lime_tabular import LimeTabularExplainer

# Reading in the data
df_titanic = pd.read_csv('./data/titanic/train.csv')
df_titanic.drop('PassengerId',axis=1,inplace=True)

# Filling the missing values
df_titanic.fillna(0,inplace=True)

# label encoding categorical data
le = LabelEncoder()

df_titanic['Pclass_le'] = le.fit_transform(df_titanic['Pclass'])
df_titanic['SibSp_le'] = le.fit_transform(df_titanic['SibSp'])
df_titanic['Sex_le'] = le.fit_transform(df_titanic['Sex'])

cat_col = ['Pclass_le', 'Sex_le','SibSp_le', 'Parch','Fare']

# create validation set
X_train,X_test,y_train,y_test = train_test_split(df_titanic[cat_col],df_titanic[['Survived']],test_size=0.3)


# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)


# training the lightgbm model
lgb_params = {
    'task': 'train',
    'boosting_type': 'goss',
    'objective': 'binary',
    'metric':'binary_logloss',
    'metric': {'l2', 'auc'},
    'num_leaves': 50,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'verbose': None,
    'num_iteration':100,
    'num_threads':7,
    'max_depth':12,
    'min_data_in_leaf':100,
    'alpha':0.5}
model = lgb.train(lgb_params,lgb_train,num_boost_round=10,valid_sets=lgb_eval,early_stopping_rounds=5)



# LIME requires class probabilities in case of classification example
def prob(data):
    return np.array(list(zip(1-model.predict(data),model.predict(data))))
    
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 46
     29 # training the lightgbm model
     30 lgb_params = {
     31     'task': 'train',
     32     'boosting_type': 'goss',
   (...)
     44     'min_data_in_leaf':100,
     45     'alpha':0.5}
---> 46 model = lgb.train(lgb_params,lgb_train,num_boost_round=10,valid_sets=lgb_eval,early_stopping_rounds=5)
     50 # LIME requires class probabilities in case of classification example
     51 def prob(data):

TypeError: train() got an unexpected keyword argument 'early_stopping_rounds'
explainer = lime.lime_tabular.LimeTabularExplainer(df_titanic[model.feature_name()].astype(int).values,  
mode='classification',training_labels=df_titanic['Survived'],feature_names=model.feature_name())


# Obtain the explanations from LIME for particular values in the validation dataset
i = 1
exp = explainer.explain_instance(df_titanic.loc[i,cat_col].astype(int).values, prob, num_features=5)

exp.show_in_notebook(show_table=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 explainer = lime.lime_tabular.LimeTabularExplainer(df_titanic[model.feature_name()].astype(int).values,  
      2 mode='classification',training_labels=df_titanic['Survived'],feature_names=model.feature_name())
      5 # Obtain the explanations from LIME for particular values in the validation dataset
      6 i = 1

NameError: name 'model' is not defined

For 1st observation, LIME provides an explanation as to the reason for assigning the particluar probability.The classifier predicts the person’s probability for surviving is 64%.The blue color contributes negatively to the positive class and the orange line, positively.

Conclusion#

Machine Learning doesn’t have to be a black box anymore. What use is a good model if we cannot explain the results to others. Interpretability is as important as creating a model. To achieve wider acceptance among the population, it is crucial that Machine learning systems are able to provide satisfactory explanations for their decisions. As Albert Einstein said,” If you can’t explain it simply, you don’t understand it well enough”.

References :#