03. Aggregating Experiments¶

Note: the appearance of this notebook will depend on the environment and screen size you're using. If the tables are being clipped or the figures look off, consider trying Google Colab or Github via the buttons below. This notebook was created in VSCode, and will likely look best locally.

In [ ]:

# Only run in Google Colab
# %pip install prob_conf_mat

Setup¶

In this example, we're considering another common scenario. Unlike the previous notebook, we ran several experiments using the same model, on different subsets of the same dataset (for example, through k-fold cross-validation). This gave us 5 independent folds, all of which together is going to say something about the average behaviour of the model on similar datasets.

In [1]:

fold_0 = [
    [32, 11],
    [0, 71],
]

fold_1 = [
    [33, 10],
    [1, 70],
]

fold_2 = [
    [25, 17],
    [0, 72],
]

fold_3 = [
    [42, 0],
    [24, 48],
]

fold_4 = [
    [34, 8],
    [3, 68],
]

For now we'll add the confusion matrices as before, and use the adjusted Balanced Accuracy score as our metric.

In [21]:

import prob_conf_mat as pcm

study = pcm.Study(
    seed=0,
    num_samples=10000,
    ci_probability=0.95,
)

for i, conf_mat in enumerate([fold_0, fold_1, fold_2, fold_3, fold_4]):
    study.add_experiment(
        f"fold_{i}",
        conf_mat,
        confusion_prior=1,
        prevalence_prior=1,
    )

study.add_metric(metric="ba+adjusted=True")

fig = study.plot_metric_summaries(
    metric="ba+adjusted=True",
    normalize=True,
    area_colour="xkcd:silver",
    area_alpha=1.0,
)

No description has been provided for this image

In [22]:

study.report_metric_summaries(metric="ba+adjusted=True")

Out[22]:

Group	Experiment	Observed	Median	Mode	95.0% HDI	MU	Skew	Kurt
fold_0	fold_0	0.7442	0.7226	0.7282	[0.5915, 0.8499]	0.2584	-0.3079	0.0694
fold_1	fold_1	0.7534	0.7322	0.7241	[0.5976, 0.8548]	0.2572	-0.3055	-0.0176
fold_2	fold_2	0.5952	0.5781	0.5694	[0.4285, 0.7173]	0.2888	-0.1290	-0.1465
fold_3	fold_3	0.6667	0.6412	0.6516	[0.5259, 0.7573]	0.2313	-0.2383	0.0886
fold_4	fold_4	0.7673	0.7455	0.7678	[0.6097, 0.8640]	0.2543	-0.4009	0.1590

In a standard analysis, we would simply report the average of the 5 observed values and call it a day. However, given that the dataset is small, and yields wide distributions of metric values, this would only provide part of the picture. Ideally, we'd be able to report an average distribution of metric values. Intuitively, this distribution would be much more precise (i.e., narrow), and centered close to the various averages.

Experiment Aggregation¶

This is possible using a variety of methods. We use conflation, an operation that generates an aggregate distribution that minimizes the loss of information. This requires making a parametric assumption though. If we (roughly) know the distribution type of the metric distributions, we can compute the conflated distribution in closed form for many distribution families. Usually, as long as the distributions aren't clumped near the metric boundaries, the normal/Gaussian distribution is a decent first guess.

To get started we'll register the various confusion matrices again, but now to the same group (note the "group/experiment" structure).

In [23]:

del study

study = pcm.Study(
    seed=0,
    num_samples=10000,
    ci_probability=0.95,
)

for i, conf_mat in enumerate([fold_0, fold_1, fold_2, fold_3, fold_4]):
    study.add_experiment(
        f"foo/fold_{i}",
        conf_mat,
        confusion_prior=1,
        prevalence_prior=1,
    )

And we'll also add the same metric again, but now with the 'aggregation' key specified.

In [24]:

study.add_metric(metric="ba+adjusted=True", aggregation="gaussian")

We can now request the aggregated distribution as well.

In [25]:

fig = study.plot_experiment_aggregation(
    metric="ba+adjusted=True",
    experiment_group="foo",
    normalize=True,
)

In [26]:

study.report_aggregated_metric_summaries(metric="ba+adjusted=True")

Out[26]:

Group	Median	Mode	HDI	MU	Kurtosis	Skew	Var. Within	Var. Between	I2
foo	0.6831	0.6774	[0.6289, 0.7438]	0.1149	-0.0121	-0.0161	0.0045	0.0000	0.00%

Looking at the figure and the table, the aggregated distribution is indeed much narrower (the MU is roughly half the size of the individual experiment distributions) and both the observed values and the medians of their distributions are symmetrically spread around the aggregated median value of $0.6831$.

The summary table now reports three additional metrics:

Var. Within: measure the variance of the distributions within each experiment
Var. Between: measure the variance between the different experiment distributions
I2: is the ratio of the variance between distributions to the total variance These metrics have to do with the observed heterogeneity between the different distributions. Currently, there is none ($I^2=0.00\%$), but we'll experiment with a case where this is the case later.

Parametric Assumptions¶

The parametric assumption we make can impact the shape and summary statistics of the final aggregated distribution. For example, the independent experiment distributions are all negatively skewed. We assumed the aggregated distribution to be a Gaussian, which has no skew or excess kurtosis. If instead, we had chosen a Beta distribution would we have gotten a substantially different result?

In [27]:

study.add_metric(metric="ba+adjusted=True", aggregation="beta")

study.report_aggregated_metric_summaries(metric="ba+adjusted=True")

Out[27]:

Group	Median	Mode	HDI	MU	Kurtosis	Skew	Var. Within	Var. Between	I2
foo	0.6845	0.6821	[0.6257, 0.7452]	0.1196	-0.0794	0.0822	0.0044	0.0001	3.04%

Not really... although the new aggregate distribution is slightly negatively skewed as well now. Currently we've implemented the following aggregation methods:

In [28]:

from collections import defaultdict

from prob_conf_mat.experiment_aggregation import AGGREGATION_REGISTRY

method_to_alias = defaultdict(list)
for alias, method in AGGREGATION_REGISTRY.items():
    method_to_alias[method].append(alias)

for method, aliases in method_to_alias.items():
    print(f"{method.__name__}: {aliases}")

SingletonAggregator: ['singleton', 'identity']
BetaAggregator: ['beta', 'beta_conflation']
GammaAggregator: ['gamma', 'gamma_conflation']
FEGaussianAggregator: ['fe', 'fixed_effect', 'fe_gaussian', 'gaussian', 'normal', 'fe_normal']
REGaussianAggregator: ['re', 'random_effect', 're_gaussian', 're_normal']
HistogramAggregator: ['hist', 'histogram']

Which is best will depend on the parametric assumption one is willing to make, and the metric. For example, the 'histogram' method does not make a parametric assumption at all, and instead computes the conflated distribution using probability density estimates using the distribution's histogram. It fails whenever there are regions of non-overlap between different experiments. Since that isn't the case here, we safely use it:

In [29]:

study.add_metric(metric="ba+adjusted=True", aggregation="histogram")

study.report_aggregated_metric_summaries(metric="ba+adjusted=True")

Out[29]:

Group	Median	Mode	HDI	MU	Kurtosis	Skew	Var. Within	Var. Between	I2
foo	0.6854	0.6835	[0.6251, 0.7519]	0.1267	-0.1119	0.1342	0.0044	0.0001	2.53%

Most metrics are bounded in the regions [0, 1] or [-1, 1]. In those cases, the Beta distribution is especially useful, since it naturally has boundaries.

Other metrics, are only bounded on one side, or not bounded at all. For example, the diagnostic odds ratio (DOR, a ratio of ratios) is bounded by [0, $\infty$). In that case, a Gamma distribution might be a better fit:

In [30]:

study.add_metric(metric="dor", aggregation="gamma")

study.report_aggregated_metric_summaries(metric="dor", class_label=0)

Out[30]:

Group	Median	Mode	HDI	MU	Kurtosis	Skew	Var. Within	Var. Between	I2
foo	5.3086	0.8144	[6.3628e-09, 61.0066]	61.0066	2.9846	12.0723	2702959667.0963	0.0000	0.00%

When looking at the distribution plot, however, it is clear that this is a particularly poorly behaved metric distribution, likely due to the fact that the confusion matrices have zero counts in the off-diagonal elements. These give non-finite results:

In [31]:

study.report_metric_summaries(metric="dor", class_label=0)

/home/ioverho/Projects/prob_conf_mat/src/prob_conf_mat/metrics/_metrics.py:976: RuntimeWarning: divide by zero encountered in divide
  return tpr / fpr
/home/ioverho/Projects/prob_conf_mat/src/prob_conf_mat/metrics/_metrics.py:1166: RuntimeWarning: divide by zero encountered in divide
  return plr / nlr

Out[31]:

Group	Experiment	Observed	Median	Mode	95.0% HDI	MU	Skew	Kurt
foo	fold_0	inf	300.7750	3286.6797	[19.3161, 4371.1336]	4351.8175	50.7696	3056.8731
foo	fold_1	231.0000	135.7813	167.3978	[15.8327, 696.0532]	680.2205	56.5959	4457.6553
foo	fold_2	inf	151.3990	25788.6734	[7.8530, 2122.3753]	2114.5222	78.6223	6795.9909
foo	fold_3	inf	124.2630	1761.1212	[7.5551, 1742.3460]	1734.7908	56.6083	3710.3558
foo	fold_4	96.3333	75.6193	48.4036	[13.3525, 256.1462]	242.7937	3.9357	30.4562

In [32]:

fig = study.plot_experiment_aggregation(
    metric="dor",
    class_label=0,
    experiment_group="foo",
    normalize=True,
)

In that case, using the same metric, but log-transformed, we get a much easier distribution to aggregate:

In [33]:

study.add_metric(metric="log_dor", aggregation="normal")

study.report_aggregated_metric_summaries(metric="log_dor", class_label=0)

Out[33]:

Group	Median	Mode	HDI	MU	Kurtosis	Skew	Var. Within	Var. Between	I2
foo	4.8418	4.8560	[3.9684, 5.6893]	1.7209	-0.0297	0.0085	1.2940	0.0000	0.00%

In [34]:

study.add_metric(metric="log_dor", aggregation="normal")

fig = study.plot_experiment_aggregation(
    metric="log_dor",
    class_label=0,
    experiment_group="foo",
    normalize=True,
)

/home/ioverho/Projects/prob_conf_mat/src/prob_conf_mat/metrics/_metrics.py:1025: RuntimeWarning: divide by zero encountered in log
  lplr = np.log(tpr) - np.log(fpr)
/home/ioverho/Projects/prob_conf_mat/src/prob_conf_mat/metrics/_metrics.py:1124: RuntimeWarning: divide by zero encountered in log
  lnlr = np.log(fnr) - np.log(tnr)

Forest Plots¶

While the experiment_aggregation plots are useful for visually establishing the shape of the metric distributions, and assessing to which degree the parametric assumptions made work, it is not the most efficient method for providing information about the experiments and their effect on the aggregated distribution.

For comparing multiple experiment groups at a quick glance, a forest plot strikes a good balance between the report_aggregated_metric_summaries and plot_experiment_aggregation methods. It is commonly used in the quantitative meta-analysis literature.

In [35]:

fig = study.plot_forest_plot(metric="ba+adjusted=True")

The medians of the individual experiments are depicted using squares, and their HDI using a horizontal line. The aggregated distribution is shown at the bottom using a large diamond for its median. The vertical line allows for easy comparison of the individual experiments to the aggregated median. As usual, there are lots of customization options available.

Dealing with Heterogeneity¶

Inter-experiment heterogeneity means that the individual experiments' medians lie far apart. So far we've only come across cases of little to no heterogeneity. There are instances, however, where one might expect substantially different metric distributions belonging to the same group. For example, if we're estimating the same model's performance on different datasets.

We can model this by adding an outlier distribution. Experiment group 'bar' is identical to 'foo' for the first 5 experiments, but includes a final outlier as well. This outlier experiment is very precise, relative to the rest.

In [36]:

study.add_metric(metric="ba+adjusted=True", aggregation="normal")

for i, conf_mat in enumerate([fold_0, fold_1, fold_2, fold_3, fold_4]):
    study.add_experiment(
        f"bar/fold_{i}",
        conf_mat,
        confusion_prior=1,
        prevalence_prior=1,
    )

outlier = [
    [164, 4],
    [8, 288],
]

study.add_experiment(
    experiment_name="bar/outlier",
    confusion_matrix=outlier,
    prevalence_prior=1.0,
    confusion_prior=1.0,
)

In [39]:

fig = study.plot_forest_plot(metric="ba+adjusted=True")

All of a sudden, the aggregated distribution lies in no-man's land. It's median does not lie within the HDI of any of the individual experiments. What's worse, the aggregated distribution is extremely narrow: we're extremely confident that the average metric lies in a region of values not seen in any of the other experiments!

Note that the I2 score is also provided. In the case without the outlier, this score was still very low $\approx3\%$. With the outlier, however, it is now $\approx 78\%$. This means that the inter-experiment variance represents roughly 3/4 of the total variance in the different experiments, which corresponds to a high degree of heterogeneity.

What can we do about it? The [REGaussianAggregator][prob_conf_mat.experiment_aggregation.aggregators.REGaussianAggregator] aggregator employs a random-effects meta-analysis approach. It assumes the different experiments will have some degree of inter-experiment variance, and tries to explicitly correct for it.

In [19]:

study.add_metric(metric="ba+adjusted=True", aggregation="re_normal")

fig = study.plot_forest_plot(metric="ba+adjusted=True")

When using it, we can immediately see the reduced effect of the outlier. While the aggregate distribution has still shifted somewhat, it now lies comfortably within the HDI of most experiments. Furthermore, the aggregate distribution is considerably wider, less precise.

So, in cases where you might expect heterogeneity to be present, consider using the re_gaussian approach, as it's considerably more robust. Unfortunately, we have (not yet) implemented robust variants for the other aggregation methods.

Next Steps¶

In the next tutorial, we'll go into how to read confusion matrices from disk, how to save and load a Study object, and other input/output operations.

For more on experiment aggregation:

Check out the Study documentation on aggregating experiments
Check out the reference documentation on implemented experiment aggregation methods
Check out the explainer on the theory behind experiment aggregation and inter-experiment heterogeneity