26

FUNDAMENTALS

the true class. This measure more heavily penalizes sharp probabilities,
which are close to the wrong edge or class by over/under-confidence.
`NLL (f ) = −

N K
1 XX
I [yi = k] · log (fk (xi ))
N i=1

(2.10)

k=1

• Brier Score [50] is a scoring rule that measures the accuracy of a
probabilistic classifier and is related to the mean-squared error (MSE) loss
function. Brier score is more commonly used in industrial practice since it
is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities
less severely than NLL.
`BS (f ) =

N K
1 XX
2
(I (yi = k) − fk (xi ))
N i=1

(2.11)

k=1

All metrics following require a CSF g(x) to be defined, and can pertain to
specific evaluation settings [389] tested in Section 3.4.5.
Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate
top-1 prediction miscalibration. A calibration estimator (Definition 7) measures
the Lp norm difference between a model’s posterior and the true likelihood of
being correct.
Definition 7 (Lp Calibration Error). [231, 463]
The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y )
with the Lp norm p ∈ [1, ∞) is given by:


CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp
(2.12)
The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the
above with p = 1, where the expectation is approximated using a histogram.
MaxCE defines the worst-case risk version with p = ∞, effectively reporting on
the bin with the highest error. As part of Chapter 5, we contributed a novel
empirical estimator of top-1 calibration for the task of VQA, where the exact
accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior
work [329] used a similar strategy of thresholding continuous quality scores to
be able to estimate ECE.
In practice, ECE is implemented as a histogram binning estimator that
discretizes predicted probabilities into ranges of possible values for which
conditional expectation can be estimated. Concretely, the probability space
is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap
between observed accuracy and bin confidence P¯b is measured, with a final