PROBLEM STATEMENT AND QUESTIONS 7 RQ 2. In which settings are PUQ methods most useful, i.e., which failure sources / distribution shifts are they most sensitive to? RQ 3. How can we obtain better PUQ estimates without overrelying on computationally prohibitive methods, e.g., Deep Ensemble [238]? RQ 4. How important are certain prior, neural architecture or hyperparameter influences on the quality of PUQ estimation? In a later chapter (Chapter 5), we introduce a complex benchmark for generic DU that additionally tests for robustness to domain, visual and layout shifts, and explores the novel problem of hallucination and control in natural language generation (NLG) with LLMs from the perspective of calibrated and selective DocVQA. The general task formulation involves a natural language question (on content, aspect, form, visual/layout), an input document, and a set of reference answers. The model is expected to provide a natural language answer, an answer confidence and a (binary) abstention decision. Evaluation is done in terms of answer correctness, calibration and selective prediction. On the one hand, one expects a model to lower confidence when unsure about the correctness of a predicted answer. On the other hand, one expects a model to abstain from answering and refrain from hallucinations on unanswerable questions (which had been explicitly added in the dataset). RQ 5. How severe is the problem of hallucination and control in LLMs when evaluated in a selective, free-form DocVQA task setting? 1.2.2 Realistic and Efficient Document Understanding The second part of the dissertation focuses on the more applied research questions of realistic and efficient DU. The overall objective is to make DU technology more generically applicable (Chapter 5), evaluation more in sync with real-world requirements (Chapters 4 and 5), and more efficient at modeling the multimodal and compositional nature of documents (Chapters 5 and 6). Due to the proximity to business applications and the risks of leaking personal information, DU research benchmarks have diverged substantially from the real-world distributions of document data. For instance, DU datasets are often limited to single-page document images, are from outdated sources (e.g., IIT-