RESEARCH CONTEXT 5 This thesis started almost concurrently with the rise of the global COVID19 pandemic, making it hard to foster collaborations in the early stages. At the start of the PhD, DU methodology was fairly established, with OCR and Transformer-based pipelines such as BERT [94] and LayoutLM [502], which is why we first prioritized the more fundamental challenge of decision-making under uncertainty (Part I); which was followed by a step back, closer to applied DU research (Part II). The research community’s understanding of ‘reliability’ has also evolved over time. When starting the work of Chapter 3, the notion of reliability was mostly associated with uncertainty quantification and calibration. However, calibration is not a panacea, and only fairly recently, Jaeger et al. [193] proposed a more general framework encapsulating reliability and robustness. They promote the more concrete and useful notion of failure prediction, which still involves confidence/uncertainty estimation yet with an explicit definition of the failure source which one wants to detect or guard against, e.g., in-domain test errors, changing input feature distributions, novel class shifts, etc. Since I share a similar view of the problem, I have focused following works on the more general notion of failure prediction, which is also more in line with the business context of IA. Whereas we originally intended to work on multi-task learning of DU subtasks, the rise of general-purpose LLMs offering a natural language interface to documents rather than discriminative modeling (e.g., ChatGPT [52, 344]), prompted us toward evaluating this promising technology in the context of DU. More importantly, we observed the lack of sufficiently complex datasets and benchmarks in DU that would allow us to tackle larger, more fundamental questions such as ’Do text-only LLMs suffice for most low-level DU subtasks?’ (subsequently tackled in Chapter 5), which is why we shifted our focus to the more applied research questions of benchmarking and evaluation (Part II). Finally, the business context has also evolved over time. Originally, IDP was practiced by legacy OCR companies; specialized vendors, offering a range of solutions for specific document types (e.g., invoices, contracts, tax forms, etc.); or cloud service providers, offering IDP as part of a larger suite of services (e.g., AWS Textract, Azure Form Recognizer, etc.). However, the rise of both open-source LLM development and powerful, though closed-source models has lowered the barrier to entry for any new entrants or incumbents. This has led to a commoditization of IDP, with the quality of the LLMs and the ease of integration with existing business processes becoming key differentiators.