28 FUNDAMENTALS The standard curve metric can be obtained by sorting all CSF estimates and P T P +F P evaluating risk ( T PF+F P ) and coverage ( T P +F P +F N +T N ) for each threshold t (P if above threshold) from high to low, together with their respective correctness (T if correct). This is normally based on exact match, yet for generative evaluation in Section 5.3.5, we have applied ANLS thresholding instead. Formulated this way, the best possible AURC is constrained by the model’s test error (1-ANLS) and the number of test instances. AURC might be more sensible for evaluating in a high-accuracy regime (e.g., 95% accuracy), where risk can be better controlled and error tolerance is an apriori system-level decision [115]. This metric was used in every chapter of Part II. For the evaluation under distribution shift in Chapter 3, we have used binary classification metrics following [172], Area Under the Receiver Operating Characteristic Curve (AUROC) and Area Under the Precision-Recall Curve (AUPR), which are threshold-independent measures that summarize detection statistics of positive (out-of-distribution) versus negative (indistribution) instances. In this setting, AUROC corresponds to the probability that a randomly chosen out-of-distribution sample is assigned a higher confidence score than a randomly chosen in-distribution sample. AUPR is more informative under class imbalance. 2.2.4 Calibration The study of calibration originated in the meteorology and statistics literature, primarily in the context of proper loss functions [330] for evaluating probabilistic forecasts. Calibration promises i) interpretability, ii) system integration, iii) active learning, and iv) improved accuracy. A calibrated model, as defined in Definition 8, can be interpreted as a probabilistic model, which can be integrated into a larger system, and can guide active learning with potentially fewer samples. Research into calibration regained popularity after repeated empirical observations of overconfidence in DNNs [156, 339]. Definition 8 (Perfect calibration). [86, 88, 520] Calibration is a property of an empirical predictor f , which states that on finite-sample data it converges to a solution where the confidence scoring function reflects the probability ρ of being correct. Perfect calibration, CE(f ) = 0, is satisfied iff: P(Y = Ŷ | f (X) = ρ) = ρ, ∀ρ ∈ [0, 1] (2.15) Below, we characterize calibration research in two directions: (A) CSF evaluation with both theoretical guarantees and practical estimation methodologies • Estimators for calibration notions beyond top-1 [229, 231, 342, 463]