6 1.2 INTRODUCTION Problem Statement and Questions The general introduction sketches the context of the research, and motivates the research questions. In this Section, I will formulate the problem statement and research questions more formally and how they relate to the manuscript’s contents. 1.2.1 Reliable and Robust Deep Learning The dissertation opens with the more fundamental challenge of targeting reliability and robustness in Deep Learning, which covers fairly abstract concepts that have been used interchangeably and inconsistently in the literature. They will be defined more extensively in Section 2.2, but for now, consider reliability as the ability to avoid failure, robustness as the ability to resist failure, and resilience as the ability to recover from failure [373, 438, 455]. In Chapter 3, we focus on the more concrete objective of predictive uncertainty quantification (PUQ), which shows promise for improving reliability and robustness in Deep Learning (DL) [123, 140, 173, 455]. Concretely, PUQ methods are expected to elucidate sources of uncertainty such as a model’s lack of in-domain knowledge due to either training data scarcity or model misspecification, or its ability to flag potentially noisy, shifted or unknown input data [136]. We observed that the majority of prior PUQ research focused on regression and CV tasks, while the applicability of PUQ methods had not been thoroughly explored in the context of NLP. As mentioned earlier, most DU pipelines (in 2020) were text-centric with a high dependency on the quality of OCR. Since OCR is often considered a solved problem [262], we hypothesized that the main source of error and uncertainty in DU would reside in the text representations learned by deep neural networks (DNN)s. This is why we focused on the more fundamental question of how well do PUQ methods scale in NLP? More specifically, we restricted the scope to the prototypical, well-studied task of text classification, for which we could leverage existing multi-domain datasets varying in complexity, size and label space (multi-class vs. multi-label). This leads to the following research questions: RQ 1. When tested in realistic language data distributions on various text classification tasks, how well do PUQ methods fare in NLP?