RELIABILITY AND ROBUSTNESS 23 domain or task). If the model has access to limited samples for training on the new distribution, it is referred to as few-shot learning or no samples at all, zero-shot learning; if it is able to adapt to new distributions over time, or accumulate knowledge over different tasks without retraining from scratch [87], it is referred to as continual learning or incremental learning. Many of these settings are referred to in business as out-of-the-box, self-learning, yet without any formal definitions given. Domain and task generalization are major selling points of pretrained LLMs, which are able to perform well on a wide range of tasks and domains. In the case of very different distributions, e.g., a different task/expected output or an additional domain/input modality, it is often necessary to fine-tune the model on a small amount of data from the new distribution, which is known as transfer learning. Specific to LLMs, instruction tuning is a form of transfer learning, where samples from a new distribution are appended with natural language instructions [69, 532]. This approach has been used in Chapter 5 to adapt pretrained LLMs to the task of DocVQA, in an effort to reduce the amount of annotated data required to generalize to unseen domains and questions. 2.2.2 Confidence Estimation A quintessential component of reliability and robustness requires a model to estimate its own uncertainty, or inversely to translate model outputs into probabilities or ‘confidence’ (Definition 6). Definition 6 [Confidence Scoring Function]. Any function g : X → R whose continuous output aims to separate a model’s failures from correct predictions can be interpreted as a confidence scoring function (CSF) [193]. Note that while it is preferable to have the output domain of g ∈ [0, 1] for easier thresholding, this is not a strict requirement. Circling back on the question of why one needs a CSF, there are multiple reasons: i) ML models are continually improving, yet 0 test error is an illusion, even a toy dataset (MNIST) is not perfectly separable; ii) once a model is deployed, performance deterioration is expected due to i.i.d. assumptions breaking; iii) generative models are prone to hallucinations [198], requiring some control mechanisms and guardrails to guide them. Below, we present some common CSFs used in practice [114, 172, 194, 539], where for convenience the subscript is reused to denote the k-th element of the output vector g(x) = gk (x).