30

FUNDAMENTALS

For the sake of completeness, there exist different notions of calibration, differing
in the subset of predictions considered over ∆Y [463]:
I. top-1 [156]
II. top-r [159]
III. canonical calibration [51]
Formally, a classifier f is said to be canonically calibrated iff,
P(Y = yk | f (X) = ρ) = ρk

∀k ∈ [K] ∧ ∀ρ ∈ [0, 1]K where K = |Y|. (2.17)

However, the most strict notion of calibration becomes infeasible to compute
once the output space cardinality exceeds a certain size [157].
For discrete target spaces with a large number of classes, there is plenty interest
in knowing that a model is calibrated on less likely predictions as well. Some
relaxed notions of calibration have been proposed, which are more feasible
to compute and can be used to compare models on a more equal footing.
These include: top-label [157], top-r [159], within-top-r [159], marginal
[229, 231, 342, 492].

2.2.5

Predictive Uncertainty Quantification

Bayes’ theorem [26] is a fundamental result in probability theory, which
provides a principled way to update beliefs about an event given new evidence.
Bayesian Deep Learning (BDL) methods build on these solid mathematical
foundations and promise reliable predictive uncertainty quantification (PUQ)
[124, 136, 140, 238, 301, 325, 326, 464, 466, 496].
The Bayesian approach consists of casting learning and prediction as an
inference task about hypotheses (uncertain quantities, with θ representing
all BNN parameters: weights w, biases b, and model structure) from training
N
data (measurable quantities, D = {(xi , yi )}i=1 = (X, Y )).
Bayesian Neural Networks (BNN) are in theory able to avoid the pitfalls
of stochastic non-convex optimization on non-linear tunable functions with
many high-dimensional parameters [300]. More specifically, BNNs can capture
the uncertainty in the NN parameters by learning a distribution over them,
rather than a single point estimate. This offers advantages in terms of data
efficiency, avoiding overfitting thanks to regularization from parameter priors,
model complexity control, and robustness to noise due to the probabilistic
nature. However, they come with their own challenges such as the increased
computational cost of learning and inference, the difficulty of specifying
appropriate weight or function priors, and the need for specialized training
algorithms or architectural extensions.