RELIABILITY AND ROBUSTNESS

23

domain or task). If the model has access to limited samples for training
on the new distribution, it is referred to as few-shot learning or no samples at
all, zero-shot learning; if it is able to adapt to new distributions over time, or
accumulate knowledge over different tasks without retraining from scratch [87],
it is referred to as continual learning or incremental learning.
Many of these settings are referred to in business as out-of-the-box, self-learning,
yet without any formal definitions given. Domain and task generalization are
major selling points of pretrained LLMs, which are able to perform well on a
wide range of tasks and domains. In the case of very different distributions, e.g.,
a different task/expected output or an additional domain/input modality, it is
often necessary to fine-tune the model on a small amount of data from the new
distribution, which is known as transfer learning. Specific to LLMs, instruction
tuning is a form of transfer learning, where samples from a new distribution are
appended with natural language instructions [69, 532]. This approach has been
used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an
effort to reduce the amount of annotated data required to generalize to unseen
domains and questions.

2.2.2

Confidence Estimation

A quintessential component of reliability and robustness requires a model to
estimate its own uncertainty, or inversely to translate model outputs into
probabilities or ‘confidence’ (Definition 6).
Definition 6 [Confidence Scoring Function]. Any function g : X → R
whose continuous output aims to separate a model’s failures from correct
predictions can be interpreted as a confidence scoring function (CSF) [193].
Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier
thresholding, this is not a strict requirement.
Circling back on the question of why one needs a CSF, there are multiple reasons:
i) ML models are continually improving, yet 0 test error is an illusion, even a
toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed,
performance deterioration is expected due to i.i.d. assumptions breaking; iii)
generative models are prone to hallucinations [198], requiring some control
mechanisms and guardrails to guide them.
Below, we present some common CSFs used in practice [114, 172, 194, 539],
where for convenience the subscript is reused to denote the k-th element of the
output vector g(x) = gk (x).