26 FUNDAMENTALS the true class. This measure more heavily penalizes sharp probabilities, which are close to the wrong edge or class by over/under-confidence. `NLL (f ) = − N K 1 XX I [yi = k] · log (fk (xi )) N i=1 (2.10) k=1 • Brier Score [50] is a scoring rule that measures the accuracy of a probabilistic classifier and is related to the mean-squared error (MSE) loss function. Brier score is more commonly used in industrial practice since it is an λ2 metric (score between 0 and 1), yet it penalizes tail probabilities less severely than NLL. `BS (f ) = N K 1 XX 2 (I (yi = k) − fk (xi )) N i=1 (2.11) k=1 All metrics following require a CSF g(x) to be defined, and can pertain to specific evaluation settings [389] tested in Section 3.4.5. Expected Calibration Error (ECE) [156, 332] is a default metric to evaluate top-1 prediction miscalibration. A calibration estimator (Definition 7) measures the Lp norm difference between a model’s posterior and the true likelihood of being correct. Definition 7 (Lp Calibration Error). [231, 463] The Lp calibration error of f : X → ∆Y over the joint distribution (X × Y ) with the Lp norm p ∈ [1, ∞) is given by:   CEp (f )p = E(X,Y ) kE[Y | f (X)] − f (X)kpp (2.12) The popular ECE metric [332] with condition I[Y = ŷ] is a special case of the above with p = 1, where the expectation is approximated using a histogram. MaxCE defines the worst-case risk version with p = ∞, effectively reporting on the bin with the highest error. As part of Chapter 5, we contributed a novel empirical estimator of top-1 calibration for the task of VQA, where the exact accuracy condition I[Y = ŷ] in ECEis replaced by I[ANLS(y, ŷ) > τ ]. Prior work [329] used a similar strategy of thresholding continuous quality scores to be able to estimate ECE. In practice, ECE is implemented as a histogram binning estimator that discretizes predicted probabilities into ranges of possible values for which conditional expectation can be estimated. Concretely, the probability space is partitioned into B bins bi with i ∈ {1, ..., B}, where for each bin bi the gap between observed accuracy and bin confidence P¯b is measured, with a final