Understanding Shapley Values and SHAP

Understanding Shapley Values and SHAP

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. In other words, it is a model-agnostic method that considers the model as a black box and only access the input and output of the model. It tries to explain the output of the model for a specific instance by computing the contribution of each component of the input to the output. The components can be features/columns in a tabular dataset, a set of pixels in an image, a word/token in a text document, etc. In this article, we first introduce the concept of Shapley values and then explain how SHAP uses Shapley values to explain the predictions of machine learning models.

Intuition

To be honest, “Shapley Values” is a term that I have been hearing a lot in the literature, but I was afraid to dive into it because it sounded too complex and there wasn’t a straightforward explanation of what it is and how it works. Recently, my supervisor shared me an article from Towards Data Science that explained Shapley values comprehensively and I thought it was time to understand what Shapley values are, once and for all! However, the article was too long (around 120 pages in a PDF format) and I think many people would be discouraged to read it. So, I decided to write a blog post to summarize the key concepts which I learned from until now and share a more straightforward explanation of Shapley values. My explanation may not be as comprehensive (or even accurate) as the original article, but I will try to make it as simple as possible to allow everyone to easily understand this great concept.

Shapley Values

To understand how Shapley values can explain the predictions of a machine learning model, we first need to understand the concept of Shapley values which come from cooperative game theory. This mathematical concept introduced by Lloyd Shapley in 1953 (and he was awarded the Nobel Prize in Economic Sciences in 2012 for this work) is used to fairly distribute the total gains of a coalition of players in a cooperative game among the players. What do we mean by “coalition of players” and “cooperative game”? Let’s break it down with an example.

Example

Imagine that you gathered some of your friends to record an album together. Some of them are good at playing an instrument, some in singing, some in mixing, some may be helpful in providing recording equipment, marketing, etc. Each of them has a different role in the process of recording and releasing the album. When the album is released, it will generate some revenue. Let’s say that the album is a hit and it generates €100,000 in revenue. Now, the question is: how should we distribute this revenue among the friends who contributed to the album? Yes, you can distribute it equally among all friends, but is it fair? Should the friend who played the drums get the same share as the friend who provided the recording equipment? Each friend had an impact on the success of the album, but their contributions are different. This is where Shapley values come into play.

Definitions

Imagine we have $M$ playes (numbered from 1 to $M$) and let $F$ be the set of all players ($F = \{1, 2, \ldots, M\}$). A coalition is a subset of players $S \subseteq F$ which can be any combination of players. The set $F$ itself is also a coalition, which commonly referred to as the grand coalition. We can have $2^M$ possible coalitions in total (the empty set $\emptyset$ is also a coalition that has no players).

The worth of a coalition is a real number that represents the total gains that the coalition can achieve and it is denoted by $v(S)$ for a coalition $S$. Note that $v$ is a function that takes a coalition as input and returns a real number as output, and commonly referred to as the characteristic function of the game. The worth of the grand coalition is the total gains that all players can achieve together, i.e., $v(F)$. The worth of the empty set is zero, i.e., $v(\emptyset) = 0$. The worth of a coalition can be calculated in different ways depending on the game. For example, in our album example, the worth of a coalition can be the revenue generated by the album when the friends in the coalition work together.

Marginal Contribution of a Player

Now the question is how we can compute the contribution of a player to the total gains of a coalition. Suppose that we have the following table and we assigned a number to each player.

Player NamePlayer Number
$Thom$$1$
$Colin$$2$
$Jonny$$3$
$Ed$$4$
$Phil$$5$
$Nigel$$6$

Let’s say we start with the empty set and add players one by one to form a coalition until we reach the grand coalition. For example, we can start with the empty set $\emptyset$ and add $Thom$ to form the coalition $\{1\}$, then add $Colin$ to form the coalition $\{1, 2\}$, and we continue this process until we reach the grand coalition $\{1, 2, 3, 4, 5, 6\}$. The total gain increases as we add more players to the coalition. Suppose the current coalition is ${1, 2, 3}$ and we want to add player $4$ to the coalition. Now, what is the marginal contribution of player $4$ to the total gains of the coalition $\{1, 2, 3, 4\}$? We can simply calculate the total gains of the coalition $\{1, 2, 3, 4\}$ and subtract the total gains of the coalition $\{1, 2, 3\}$ from it. This difference is the contribution of player $4$.

$$ \text{Marginal Contribution of Player 4} = v(\{1, 2, 3, 4\}) - v(\{1, 2, 3\}) $$
Order of adding players to the coalition is important! Suppose $Thom$ ($1$) have started the band and recorded some songs. However, when $Colin$ ($2$) joined the band, they re-recorded some songs and their revenue increased by €10,000 (which is the contribution of $Colin$). Now, when $Jonny$ ($3$) joined the band, they re-recorded some songs again and their revenue increased by just €2,000. However, $Jonny$ feels unfair when comparing his contribution to $Colin$’s contribution. He believes that if he had joined the band before $Colin$, the revenue would have also increased by €10,000. So, we need to consider all possible orders of adding players to the coalition to calculate the contribution of a player fairly.

Expected Marginal Contribution / Shapley Value

We’ve seen how we can calculate the contribution of a player to the total gains of a specific coalition. However, as we mentioned earlier, the contribution of a player depends on the order of adding players to the coalition and we need to consider all possible orders. To do this, we compute the marginal contribution of a player for all possible permutations of players ($F$) and take the average of these contributions. This Expected Marginal Contribution is called the Shapley value of the player. As we will have $|F|!$ permutations in total, the Shapley value of player $i$, denoted by $\phi_i$, can be calculated as follows:

$$ \phi_i = \frac{1}{|F|!} \sum_{p \in P} \left( v(S \cup \{i\}) - v(S) \right) $$

In this formula, $P$ is the set of all permutations of players, and $S$ is the coalition related to the permutation $p \in P$. The following table shows how we can calculate the Shapley value of player $4$ for each permutation of players.

PermutationMarginal Contribution of Player 4
$[1, 2, 3, 4, 5, 6]$$v(\{1, 2, 3, 4\}) - v(\{1, 2, 3\})$
$[2, 1, 3, 4, 5, 6]$$v(\{1, 2, 3, 4\}) - v(\{1, 2, 3\})$
$[3, 2, 1, 4, 5, 6]$$v(\{1, 2, 3, 4\}) - v(\{1, 2, 3\})$
$\vdots$$\vdots$
$[4, 1, 2, 3, 5, 6]$$v(\{4\}) - v(\emptyset)$
$\vdots$$\vdots$
$[6, 5, 4, 3, 2, 1]$$v(\{4, 5, 6\}) - v(\{5, 6\})$
The characteristic function $v$ takes a coalition/set as input, not a permutation. This is because the worth of a coalition is independent of the order of players in the coalition. The worth of the coalition of $Thom$ and $Colin$ is the same as the coalition of $Colin$ and $Thom$.

Now, we take the average of these contributions to calculate the Shapley value of player $4$. As we can see, some permutations have the same marginal contribution since their coalitions are the same. So, why don’t we make it simpler and only calculate the distinct values of contributions and multiply them by the number of times they have been repeated in the permutations?

To do this, we need to find how many permutations can be formed from each coalition. If you look at the table above, you can see that the term $v(\{1, 2, 3, 4\}) - v(\{1, 2, 3\})$ is repeated for all permutations that the player $4$ is added after the players $1$, $2$, and $3$. So, if $S=\{1, 2, 3\}$, the player $4$ can be added after theses players in $|S|! = 3! = 6$ different ways. Furthermore, the remaining players can be added to the coalition in $(|F|-|S|-1)! = (6-3-1)! = 2!$ different ways. So, the Shapley value of player $4$ can be calculated as follows:

$$ \phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|! \cdot (|F|-|S|-1)!}{|F|!} \left( v(S \cup \{i\}) - v(S) \right) $$

Axioms of Shapley Values

Shapley values have some provable properties that make them unique and fair. This is because of these axioms that Shapley values are mathematically strong and widely accepted. The axioms of Shapley values are as follows:

🎯 Efficiency

The sum of Shapley values of all players should be equal to the total gains of the grand coalition. In other words, if we sum the Shapley values of all players, it should be equal to the worth of the case where all players work together.

$$ \sum_{i=1}^{|F|} \phi_i = v(F) $$

⚖️ Symmetry

If two players have the same contribution to every possible coalition, their Shapley values should be the same.

$$ \text{If } v(S \cup \{i\}) = v(S \cup \{j\}) \text{ for all } S \subseteq F \setminus \{i, j\}, \text{ then } \phi_i = \phi_j $$

⭕ Dummy/Null Player

If a player has no contribution to any coalition, its Shapley value should be zero.

$$ \text{If } v(S \cup \{i\}) = v(S) \text{ for all } S \subseteq F \setminus \{i\}, \text{ then } \phi_i = 0 $$

➕ Additivity

If we have two games (or two characteristic functions) and we calculate the Shapley values of players for each game separately, the Shapley value of a player in the combined game should be the sum of the Shapley values of the player in each game. This axiom is based on the assumption that any game played are independent of each other.

$$ \phi_i(u+v) = \phi_i(u) + \phi_i(v) $$

Shapley Values in Machine Learning

Now, let’s see how Shapley values can be used to explain the predictions of machine learning models. Suppose we have a training tabular dataset with $N$ samples and $M$ features. The following table shows the features ($X_1, X_2, \ldots, X_M$) and the target variable ($y$) of the dataset.

$X_1$$X_2$$\ldots$$X_M$$y$
$x_1^{(1)}$$x_2^{(1)}$$\ldots$$x_M^{(1)}$$y^{(1)}$
$x_1^{(2)}$$x_2^{(2)}$$\ldots$$x_M^{(2)}$$y^{(2)}$
$\vdots$$\vdots$$\vdots$$\vdots$$\vdots$
$x_1^{(N)}$$x_2^{(N)}$$\ldots$$x_M^{(N)}$$y^{(N)}$

We train a model ($f$) on this dataset and we want to explain the prediction of the model for a specific sample ($x$). The model/function $f$ takes the input sample $x$ and returns the output $f(x)$. By comparing $f(x)$ with the true target value $y$, we can calculate the prediction error of the model for the sample $x$. The question is: how can we explain the prediction of the model for the sample $x$? Which features of the sample $x$ have the most impact on the prediction of the model? This is where Shapley values come into play.

We can consider the model as a cooperative game where the players are the $M$ features. However, what should be the worth of a coalition in this game? We can define it using the $f(x)$ with a slight modification. As you may remember, the characteristic function $v$ should output $0$ when the coalition is empty. But how can the model predict the output when there are no features in the coalition? Most of the ML models doesn’t support NA values, so we need to modify the characteristic function to handle this case. We can use a sample of training data (or all of them) and take the average of the predictions of the model for these samples as our best guess. Suppose we have a sample of $k \leq N$ samples, our prediction when we have no features in the coalition can be calculated as follows:

$$ f(x) = f(\text{NA}, \text{NA}, \ldots, \text{NA}) = \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) $$

Now, we can define the characteristic function $v$ as follows:

$$ v(F) = v({X_1, X_2, \ldots, X_M}) = f(x) - E[f(x)] = f(x) - \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) $$

$E(f(x))$ is the expected value of the model’s prediction when we have no features in the coalition. The worth of the grand coalition is the difference between the model’s prediction for the sample $x$ and the average prediction of the model for the samples in the training dataset. The worth of the empty set is zero, i.e., $v(\emptyset) = 0$ (as $f(x) = E[f(x)]$ when we have no features).

How can we apply the model to a subset of its original features (a coalition)? For example, if we have the coalition $S=\{X_{s1}, X_{s2}, \ldots, X_{sp}\}$, we need the marginal value of $f$ for these features which is called $f_S(x_S)$.

$$ f_S(x_S) = f_S(x_{s1}, x_{s2}, \ldots, x_{sp}) $$

There is two possible ways to do this:

  1. We can retrain the same type of model on the features present in the coalition $S$
  2. We can use the original model $f$ to calculate $f_S$ $\rightarrow$ Replace the features which are not available with NA values

Let’s assume that the model accepts NA values and see how the Shapley values can be calculated, then discuss how we can handle the case when the model doesn’t support NA values. The worth of the coalition $S$ can be calculated as follows:

$$ v(S) = v(\{x_{s1}, x_{s2}, \ldots, x_{sp}\}) = f_S(x_S) - E[f(x)] $$

To calculate the Shapley value of the feature $X_i$, we need to consider all possible permutations of features and calculate the marginal contribution of the feature $X_i$ for each permutation. The Shapley value of the feature $X_i$ can be calculated as follows:

$$ \phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|! \cdot (|F|-|S|-1)!}{|F|!} \left( f_{S\cup \{i\}}(x_{S \cup \{i\}}) - E[f(x)] - (f_S(x_S) - E[f(x)] \right) $$$$ \phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|! \cdot (|F|-|S|-1)!}{|F|!} \left( f_{S\cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S) \right) $$
🎯 Efficiency Property

If we use the efficiency property of Shapley values, we’ll find out that the sum of Shapley values of all features should be equal to the difference between the model’s prediction for the sample $x$ and the average prediction of the model for the samples in the training dataset.

$$ \sum_{i=1}^{|F|} \phi_i = v(F) = f(x) - E[f(x)] $$

Explainer Model

As we are moving towards answering the question of how we can calculate and use Shapley values to explain the predictions of machine learning models, we need to discuss the concept of the explainer model and formulate it. The explainer ($g$) is an interpretable model that takes $|F| = M$ binary variables as a vector input ($z'$).

$$ g(z') = g(z'_1, z'_2, \ldots, z'_M) \quad z'_i \in \{0, 1\} $$

The coalition vector $z'$ represents a coalition of the available values of $x$. Those elements that are NA in the original sample $x$ are replaced with $0$ in the vector $z'$ and the rest of the elements are replaced with $0$ or $1$. For example, suppose we have five features and the second feature for a given sample is NA:

$$ x = [x_1, NA, x_3, x_4, x_5] $$

We can define $x'$ as a simplified input features to show if a feature is present or not in the sample $x$:

$$ x' = [1, 0, 1, 1, 1] $$

Furthermore, assume that there is a mapping function $h_x$ that maps the coalition vector $x'$ to the original sample $x$:

$$ h_x(x') = x$$

Now, we can define any coalition vector $z'$. For example, the following coalition vector represents the coalition $S = \{X_1, X_3, X_4\}$:

$$ z' = [1, 0, 1, 1, 0] $$$$ z = h_x(z') = [x_1, NA, x_3, x_4, NA] $$

We want the prediction of $f$ for $z$ to be as close as possible to the prediction of $g$ for $z'$:

$$ g(z') \approx f(h_x(z')) \quad \text{whenever} \enspace z' \approx x' $$

Additive Feature Attribution Method

We can classify the explaination methods based on $g$. For instance, if $g$ is a linear model, we call it Additive Feature Attribution Method. Suppose $c_i$ are some constants, then the prediction of $g$ can be calculated as follows:

$$ g(z') = \phi_0 + \phi_1 z'_1 + \phi_2 z'_2 + \ldots + \phi_M z'_M = \phi_0 + \sum_{i=1}^M \phi_i z'_i $$

We can compute the Shapley values as follows:

$$ \phi_i(f,x) = \sum_{z' \subseteq x'} \frac{|z' - 1|! \cdot (|x'|-|z'|)!}{|x'|!} \left( f_x(z') - f_x(z' \setminus \{i\}) \right) $$

Note that the $\phi_0$ is the average prediction of the model for the samples in the training dataset ($E[f(x)]$).

As model $g$ can mimic $f$ perfectly for a single prediction ($f(X)$) and is linear (therefore, interpretable), we can use it as an explainer model for $f$.

SHAP

In the above section, we assumed that the model $f$ can handle NA values and we can calculate the Shapley values directly. However, in practice most of the machine learning models don’t support NA values and we need to find a way to calculate the Shapley values without using NA values. SHAP (SHapley Additive exPlanations) is an additive feature attribution method which proposes to use a conditional probability distribution to estimate the Shapley values.

$$ f_x(z') f(h_x(z'))= E[f(z) | z_s] $$

However, how can we calculate the conditional expectations? There are different ways to do this. Let’s start with the simplest one. Suppose $x_{\bar{S}}$ denotes the part of original features which are not in the coalition $S$ and $f(x_{\bar{S}}, x_S)$ means that some of the parameters of $f$ belong to $x_S$ and the rest belong to $x_{\bar{S}}$.

$$ f_S(x_S) = E[f(x) | x_S] = E[f(x_{\bar{S}}, x_S) | x_S] = \int f(x_{\bar{S}}, x_S) P(x_{\bar{S}} | x_S) dx_{\bar{S}} $$

We assume that the features are independent, as we don’t know the distribution of the features. Therefore, we can write the above equation as follows:

$$ f_S(x_S) = \int f(x_{\bar{S}}, x_S) P(x_{\bar{S}}) dx_{\bar{S}} $$

We can approximate this integral with a sum over the a subset of the training samples.

$$ f_S(x_S) \approx E[f(x) | x_S] \approx \frac{1}{k} \sum_{i=1}^k f(x_{\bar{S}}^{(i)}, x_S) $$

Let’s see how this works with an example. Suppose the model is trained on five features ($X_1$ to $X_5$) and the coalition is $S = \{X_1, X_3, X_4\}$. So, $X_{\bar{S}} = \{X_2, X_5\}$. The feature vector $x$ is as follows:

$$ x = [x_1, \text{NA}, x_3, x_4, \text{NA}] $$

To calculate $f(X)$, we need a value for the NA values. We borrow this value from the training samples. Suppose $i$-th training sample has the following values:

$$ x^{(i)} = [x_1^{(i)}, x_2^{(i)}, x_3^{(i)}, x_4^{(i)}, x_5^{(i)}] $$

Then, we can calculate the prediction of the model for the sample $x$ as follows:

$$ f(x) = f([x_1, x_2^{(i)}, x_3, x_4, x_5^{(i)}]) $$

We do this for all selected training samples and take the average of these predictions to calculate the Shapley values.

Sample$x_1$$x_2$$x_3$$x_4$$x_5$$f(x)$
$1$$x_1^{(1)}$$x_2^{(1)}$$x_3^{(1)}$$x_4^{(1)}$$x_5^{(1)}$$f([x_1, x_2^{(1)}, x_3, x_4, x_5^{(1)}])$
$2$$x_1^{(2)}$$x_2^{(2)}$$x_3^{(2)}$$x_4^{(2)}$$x_5^{(2)}$$f([x_1, x_2^{(2)}, x_3, x_4, x_5^{(2)}])$
$3$$x_1^{(3)}$$x_2^{(3)}$$x_3^{(3)}$$x_4^{(3)}$$x_5^{(3)}$$f([x_1, x_2^{(3)}, x_3, x_4, x_5^{(3)}])$
$\vdots$$\vdots$$\vdots$$\vdots$$\vdots$
$k$$x_1^{(k)}$$x_2^{(k)}$$x_3^{(k)}$$x_4^{(k)}$$x_5^{(k)}$$f([x_1, x_2^{(k)}, x_3, x_4, x_5^{(k)}])$
$\text{Average}$$\frac{1}{k} \sum_{i=1}^k f([x_1, x_2^{(i)}, x_3, x_4, x_5^{(i)})$

Linear SHAP (without dependence)

Suppose the model $f$ is a linear regression model which the features are independent. The prediction of the model can be calculated as follows:

$$ f(x) = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_M x_M = \sum_{i=0}^M w_i x_i \quad x_0 = 1 $$

The Shapley value of the feature $X_i$ can be calculated as follows:

$$ \phi_i = c_i x_i - c_i \frac{1}{k} \sum_{j=1}^k x_i^{(j)} $$

Linear SHAP (with dependence)

We can’t use the above formula for a linear model with feature dependence. In this case, we should calculate the SHAP values using all the possible coalitions.

Kernel SHAP

To be continued… [See resources]

Tree SHAP

To be continued… [See resources]

🗂️ Resources

📰 Blog Posts

🎞️ Videos

👨‍💻 Coding