Computing Metrics¶

Within the context of a classification experiment, a metric is some function that a confusion matrix as argument, and yields a single value that summarizes the classifier's performance or quality. Typically, the larger a metric value, the better the classifier performed on the evaluation data.

Metrics come in two flavors:

Binary: the metric computes a single value for each class in isolation. As a result, a single \(|\mathcal{Y}\times\mathcal{Y}|\) confusion matrix produces \(\mathcal{Y}\) values; one for each condition class
Multi-Class: the metric computes a single value for the whole confusion matrix, holistically combining the performance on individual classes into a single value.

While binary metrics are invaluable when comparing performance across different classes, with multi-class metrics it's substantially easier to interpret the model's performance. When a single value is preferred, it's possible to construct a multi-class metric from a binary one by adding a metric averaging method.

This process is shown in the figure below. To compute the F1 score for each class, we first split up the multi-class confusion matrix into virtual binary confusion matrices, then we compute the F1 score in isolation, and finally we average the scores across the different classes to produce a single F1-macro score.

A multi-class being broken into binary confusion matrices for binary metric computation.

In practice, the decomposition of the multi-class confusion matrix only happens virtually, to leverage vectorized metric computations in NumPy.

With prob_conf_mat, all binary metrics can be combined with a metric averaging method using metric syntax strings.

Computing Metrics in Order¶

Many of the more complex classification metrics are dependent on other metrics. For example, the most common definition of the F1-score is as the harmonic average of the precisions and recall:

\[\text{F1}=2\frac{\text{PPV}\times \text{TPR}}{\text{PPV} + \text{TPR}}\]

To ensure no metric is computed before its dependencies, and to limit the amount of repeated work, we can create a computation schedule by generating a topological sort of the metric dependency tree.

A computation schedule for computing the F1 score.

The figure above displays such a computation schedule for F1. It depends on the PPV and TPR, which in turn depend on the two conditional distributions, each of which is computed from the (normalized) confusion matrix.