28

FUNDAMENTALS

The standard curve metric can be obtained by sorting all CSF estimates and
P
T P +F P
evaluating risk ( T PF+F
P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P
if above threshold) from high to low, together with their respective correctness (T
if correct). This is normally based on exact match, yet for generative evaluation
in Section 5.3.5, we have applied ANLS thresholding instead. Formulated
this way, the best possible AURC is constrained by the model’s test error
(1-ANLS) and the number of test instances. AURC might be more sensible for
evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be
better controlled and error tolerance is an apriori system-level decision [115].
This metric was used in every chapter of Part II.
For the evaluation under distribution shift in Chapter 3, we have used binary
classification metrics following [172], Area Under the Receiver Operating
Characteristic Curve (AUROC) and Area Under the Precision-Recall
Curve (AUPR), which are threshold-independent measures that summarize
detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability
that a randomly chosen out-of-distribution sample is assigned a higher confidence
score than a randomly chosen in-distribution sample. AUPR is more informative
under class imbalance.

2.2.4

Calibration

The study of calibration originated in the meteorology and statistics literature,
primarily in the context of proper loss functions [330] for evaluating
probabilistic forecasts. Calibration promises i) interpretability, ii) system
integration, iii) active learning, and iv) improved accuracy. A calibrated model,
as defined in Definition 8, can be interpreted as a probabilistic model, which can
be integrated into a larger system, and can guide active learning with potentially
fewer samples. Research into calibration regained popularity after repeated
empirical observations of overconfidence in DNNs [156, 339].
Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of
an empirical predictor f , which states that on finite-sample data it converges
to a solution where the confidence scoring function reflects the probability ρ of
being correct. Perfect calibration, CE(f ) = 0, is satisfied iff:
P(Y = Ŷ | f (X) = ρ) = ρ,

∀ρ ∈ [0, 1]

(2.15)

Below, we characterize calibration research in two directions: (A) CSF evaluation
with both theoretical guarantees and practical estimation methodologies
• Estimators for calibration notions beyond top-1 [229, 231, 342, 463]