Model with conditional interaction
In this example, we show the heterogeneity of the global effects, using PDP, ALE, and RHALE, on a model with conditional interactions. We will use the following model:
where the features \(x_1, x_2, x_3\) are independent and uniformly distributed in the interval \([-1, 1]\). The model has an interaction between \(x_1\) and \(x_2\) caused by the terms: \(f_{1,2}(x_1, x_2) = -x_1^2 \mathbb{1}_{x_2 <0} + x_1^2 \mathbb{1}_{x_2 \geq 0}\). This means that the effect of \(x_1\) on the output \(y\) depends on the value of \(x_2\) and vice versa. Terms like this introduce heterogeneity. Each global effect method has a different formula for qunatifying such heterogeneity; below, we will see how PDP, ALE, and RHALE handles it.
In contrast, \(x_3\) does not interact with any other feature, so its global effect has zero heterogeneity.
import numpy as np
import matplotlib.pyplot as plt
import effector
np.random.seed(21)
model = effector.models.ConditionalInteraction()
dataset = effector.datasets.IndependentUniform(dim=3, low=-1, high=1)
x = dataset.generate_data(10_000)
PDP
Effector
Let's see below the PDP heterogeneity for each feature, using effector
.
pdp = effector.PDP(x, model.predict, dataset.axis_limits)
pdp.fit(features="all", centering=True)
for feature in [0, 1, 2]:
pdp.plot(feature=feature, centering=True, heterogeneity=True, y_limits=[-2, 2])
for feature in [0, 1, 2]:
pdp.plot(feature=feature, centering=True, heterogeneity="ice", y_limits=[-2, 2])
pdp = effector.PDP(x, model.predict, dataset.axis_limits, nof_instances="all")
pdp.fit(features="all", centering=True)
heter_per_feat = []
for feature in [0, 1, 2]:
y_mean, y_var = pdp.eval(feature=feature, xs=np.linspace(-1, 1, 100), centering=True, heterogeneity=True)
print(f"Heterogeneity of x_{feature}: {y_var.mean():.3f}")
heter_per_feat.append(y_var.mean())
Heterogeneity of x_0: 0.092
Heterogeneity of x_1: 0.088
Heterogeneity of x_2: 0.000
Conclusions:
- The global effect of \(x_1\) arises from heterogenous local effects, as \(h(x_1) > 0\) for all \(x_1\). The std margins (red area of height \(\pm h(x_1)\) around the global effect) musleadingly suggest that the heterogeneity is minimized at \(x_1 = \pm \frac{2}{3}\). ICE provide a clearer picture; they reveal two groups of effects, \(-x_1^2 + c_1\) and \(x_1^2 + c_2\). The heterogeneity as a scalar value is \(H_{x_1} \approx 0.9\).
- Similar to \(x_1\), the global effect of \(x_2\) arises from heterogeneous local effects. However, unlike \(x_1\), both std margins and ICE plots indicate a constant heterogeneity along all the axis, i.e., \(h(x_2) \approx 0.9 \forall x_2\). ICE plots further show a smooth range of varying local effects around the global effect, without distinct groups. The heterogeneity as a scalar is $ H_{x_2} \approx 0.9 $, which is identical to $ H_{x_1} $. This is consistent, as heterogeneity measures the interaction between a feature and all other features. In this case, since only $ x_1 $ and $ x_2 $ interact, their heterogeneity values should be the same.
- \(x_3\) shows no heteregeneity; all local effects align perfectly with the global effect.
Derivations
How PDP (and effector) reached such hetergoneity functions?
For \(x_1\):
The average effect is \(f^{PDP}(x_1) = 0\). ICE plots are: $-x_1^2 + \frac{1}{3} $ when \(x_2^i <0\) and $x_1^2 - \frac{1}{3} $ when \(x_2^i \geq 0\). Due to the square, they both create the same deviation from the average effect: $ \left ( x_1^2 - \frac{1}{3} \right )^2 $.
The heterogeneity as a scalar is simply the mean of the heterogeneity function:
For \(x_2\):
The average effect is \(f^{PDP}(x_2) = -\frac{1}{3} \mathbb{1}_{x_2 < 0} + \frac{1}{3} \mathbb{1}_{x_2 \geq 0}\) and the ICE plots are \(f^{ICE,i}(x_2) = - (x_1^i)^2 \mathbb{1}_{x_2 < 0} + (x_1^i)^2 \mathbb{1}_{x_2 \geq 0}\).
The heterogeneity as a scalar is simply the mean of the heterogeneity function, so:
\(\(H_{x_2} \approx 0.9\)\).
For \(x_3\), there is no heterogeneity so \(h(x_3)=0\) and \(H_{x_3} = 0\).
Conclusions
PDP heterogeneity provides intuitive insights:
- The global effects of \(x_1\) and \(x_2\) arise from heterogeneous local effects, while \(x_3\) shows no heterogeneity.
- The heterogeneity of \(x_1\) and \(x_2\) is quantified at the same level (0.99), which makes sense since only these two features interact.
- However, the heterogeneity of \(x_1\) appears misleading when centering ICE plots, as it falsely suggests minimized heterogeneity at \(\pm \frac{2}{3}\), which is not accurate.
def pdp_ground_truth(feature, xs):
if feature == 0:
ff = lambda x: x**4 - 2/3*x**2 + 1/9
return ff(xs)
elif feature == 1:
ff = lambda x: np.ones_like(x) * 0.088
return ff(xs)
elif feature == 2:
ff = lambda x: np.zeros_like(x)
return ff(xs)
# make a test
xx = np.linspace(-1, 1, 100)
for feature in [0, 1, 2]:
pdp_mean, pdp_heter = pdp.eval(feature=feature, xs=xx, centering=True, heterogeneity=True)
y_heter = pdp_ground_truth(feature, xx)
np.testing.assert_allclose(pdp_heter, y_heter, atol=1e-1)
ALE
Effector
Let's see below the PDP effects for each feature, using effector
.
ale = effector.ALE(x, model.predict, axis_limits=dataset.axis_limits)
ale.fit(features="all", centering=True, binning_method=effector.axis_partitioning.Fixed(nof_bins=31))
for feature in [0, 1, 2]:
ale.plot(feature=feature, centering=True, heterogeneity=True, y_limits=[-2, 2])
ALE states that:
-
Feature \(x_1\):
The heterogeneity varies across all values of \(x_1\). It starts large at \(x_1 = -1\), decreases until it becomes zero at \(x_1 = 0\), and then increases again until \(x_1 = 1\).
This behavior contrasts with the heterogeneity observed in the PDP, which has two zero-points at \(x_1 = \pm \frac{2}{3}\). -
Feature \(x_2\):
Heterogeneity is observed only around \(x_2 = 0\). This behavior also contrasts PDP's heterogeneity which is constant for all values of \(x_2\) -
Feature \(x_3\):
No heterogeneity is present for this feature.
ale.feature_effect["feature_1"]
{'limits': array([-1. , -0.93548387, -0.87096774, -0.80645161, -0.74193548,
-0.67741935, -0.61290323, -0.5483871 , -0.48387097, -0.41935484,
-0.35483871, -0.29032258, -0.22580645, -0.16129032, -0.09677419,
-0.03225806, 0.03225806, 0.09677419, 0.16129032, 0.22580645,
0.29032258, 0.35483871, 0.41935484, 0.48387097, 0.5483871 ,
0.61290323, 0.67741935, 0.74193548, 0.80645161, 0.87096774,
0.93548387, 1.00000001]),
'dx': array([0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451613, 0.06451613, 0.06451613, 0.06451613, 0.06451613,
0.06451614]),
'points_per_bin': array([320, 339, 318, 345, 313, 312, 331, 311, 293, 326, 309, 294, 314,
312, 335, 318, 354, 328, 312, 331, 315, 307, 341, 320, 330, 336,
312, 368, 325, 294, 337]),
'bin_effect': array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
9.93884099, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]),
'bin_variance': array([ 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
82.51292593, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]),
'bin_estimator_variance': array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.25947461, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]),
'alg_params': 'fixed',
'norm_const': np.float64(0.32060777402612833)}
Derivations
For x_1:
The \(x_1\)-axis is divided into \(K\) equal bins, indexed by \(k = 1, \ldots, K\), with the center of the \(k\)-th bin denoted as \(c_k\). In each bin: if \(x_2 < 0\), the local effects is \(-2c_k\), and if \(x_2 \geq 0\) the local effect is \(2 c_k\). This gives a variance of \(4c_k^2\) within each bin. Therefore, \(h(x_1) = 4c_k^2\) where \(k\) is the index of the bin that contains \(x_1\).
For \(x_2\):
In all bins except the central one, the local effects are zero. In the central bin, however, the local effects are $ 2(x_1^i)^2 $, which introduces some heterogeneity. So, if $ x_2 $ is not in the central bin $ k = K/2 $, \(h(x_2) = 0\). If $ x_2 $ is in the central bin $ k = K/2 $:
So: \begin{align} h(x_2) &= \begin{cases} 0, & \text{if } x_2 \text{ is not in the central bin (}\ k = K/2\text{)}, \ \frac{31}{3} \approx 10 & \text{if } x_2 \text{ is in the central bin (}\ k = K/2\text{)}. \end{cases} \ \end{align}
For \(x_3\):
The effect is zero everywehere.
def ale_ground_truth(feature):
K = 31
bin_centers = np.linspace(-1 + 1/K, 1 - 1/K, K)
if feature == 0:
return 4*bin_centers**2
elif feature == 1:
y = np.zeros_like(bin_centers)
y[15] = None
return y
elif feature == 2:
return np.zeros_like(bin_centers)
# make a test
K = 31
bin_centers = np.linspace(-1 + 1/K, 1 - 1/K, K)
for feature in [0, 1, 2]:
bin_var = ale.feature_effect[f"feature_{feature}"]["bin_variance"]
gt_var = ale_ground_truth(feature)
mask = ~np.isnan(gt_var)
np.testing.assert_allclose(bin_var[mask], gt_var[mask], atol=1e-1)
Conclusions
Is the heterogeneity implied by the ALE plots meaningful? It is
RHALE
Effector
Let's see below the RHALE effects for each feature, using effector
.
rhale = effector.RHALE(x, model.predict, model.jacobian, axis_limits=dataset.axis_limits)
rhale.fit(features="all", centering=True)
for feature in [0, 1, 2]:
rhale.plot(feature=feature, centering=True, heterogeneity=True, y_limits=[-2, 2])
RHALE states that:
-
Feature \(x_1\): As in ALE, the heterogeneity varies across all values of \(x_1\). It starts large at \(x_1 = -1\), decreases until it becomes zero at \(x_1 = 0\), and then increases again until \(x_1 = 1\).
-
Feature \(x_2\):
No heterogeneity is present for this feature. -
Feature \(x_3\):
No heterogeneity is present for this feature.
Derivations
Conclusions
Are the RHALE effects intuitive?