30 FUNDAMENTALS For the sake of completeness, there exist different notions of calibration, differing in the subset of predictions considered over ∆Y [463]: I. top-1 [156] II. top-r [159] III. canonical calibration [51] Formally, a classifier f is said to be canonically calibrated iff, P(Y = yk | f (X) = ρ) = ρk ∀k ∈ [K] ∧ ∀ρ ∈ [0, 1]K where K = |Y|. (2.17) However, the most strict notion of calibration becomes infeasible to compute once the output space cardinality exceeds a certain size [157]. For discrete target spaces with a large number of classes, there is plenty interest in knowing that a model is calibrated on less likely predictions as well. Some relaxed notions of calibration have been proposed, which are more feasible to compute and can be used to compare models on a more equal footing. These include: top-label [157], top-r [159], within-top-r [159], marginal [229, 231, 342, 492]. 2.2.5 Predictive Uncertainty Quantification Bayes’ theorem [26] is a fundamental result in probability theory, which provides a principled way to update beliefs about an event given new evidence. Bayesian Deep Learning (BDL) methods build on these solid mathematical foundations and promise reliable predictive uncertainty quantification (PUQ) [124, 136, 140, 238, 301, 325, 326, 464, 466, 496]. The Bayesian approach consists of casting learning and prediction as an inference task about hypotheses (uncertain quantities, with θ representing all BNN parameters: weights w, biases b, and model structure) from training N data (measurable quantities, D = {(xi , yi )}i=1 = (X, Y )). Bayesian Neural Networks (BNN) are in theory able to avoid the pitfalls of stochastic non-convex optimization on non-linear tunable functions with many high-dimensional parameters [300]. More specifically, BNNs can capture the uncertainty in the NN parameters by learning a distribution over them, rather than a single point estimate. This offers advantages in terms of data efficiency, avoiding overfitting thanks to regularization from parameter priors, model complexity control, and robustness to noise due to the probabilistic nature. However, they come with their own challenges such as the increased computational cost of learning and inference, the difficulty of specifying appropriate weight or function priors, and the need for specialized training algorithms or architectural extensions.