Abstract Human communication is increasingly document-based, requiring machines to understand a wide variety of visually-rich documents to assist humans in their daily lives. Amid the digital evolution, documents continue to facilitate crucial human and organizational interactions but are tethered to manual processing, causing inefficiency. We examine why organizations lag in adopting automated document processing solutions and outline two primary challenges: the complexity of processing long, multimodal documents algorithmically and the necessity for reliability and control over associated risks. Automated decisionmaking is key to improving the efficiency of document processing, but the current state-of-the-art technology is not yet reliable and robust enough to be deployed in autonomous systems. The practical objective set is to develop Intelligent Automation () systems capable of estimating confidence in their actions, thereby increasing throughput without accruing additional costs due to errors. We analyze the key challenges and propose solutions to bridge the gap between research and practical applications, with a focus on realistic datasets and experimental methodologies. Building upon foundations of Document Understanding (), this dissertation introduces advanced methodologies combining Machine Learning, Natural Language Processing, and Computer Vision. Addressing the evident gaps in research, this work presents novel methods for predictive uncertainty quantification () alongside practical frameworks for evaluating the robustness and reliability of DU technologies. The contribution culminates in the introduction of two novel multipage document classification datasets and a multifaceted benchmark, DUDE , designed to rigorously challenge and assess the state-of-the-art in DU. Extensive experiments across these datasets reveal that while advancements have been made, significant room for improvement remains, particularly in long-context modeling for multipage document processing and calibrated, selective document visual question answering. Efficient DU is also explored, revealing the effectiveness of iii