8

INTRODUCTION

CDIP [252]), or are restricted to a single domain or a small set of document
types.
We posit that larger, fundamental questions in DU remain unanswered due to a
lack of sufficiently complex datasets and benchmarks with a rich methodology
covering evaluation beyond the independent and identically distributed (i.i.d.)
test set setting. While there exist performant models for DU subtasks such
as OCR, DC, KIE, etc., it is unclear how to move from these specific analysis
and recognition tasks to models that can reason and understand documents. A
truly end-to-end DU solution must handle the complexity and variety of realworld documents and subtasks, which could be expressed as natural language
questions. Moreover, it should be able to generalize to any question on any
document and reason over multiple pages and modalities.
The following research questions are addressed in Chapters 4 and 5:
RQ 6. How can we iteratively close the gap between research and practice in DU?
RQ 7. How can we design a resource that comprehensively challenges the state-ofthe-art?
RQ 8. Which DU aspects are most challenging for current state-of-the-art LLMs?
How can these be incorporated in a benchmark to allow proper measurements
of future improvements?
However, moving the goalpost beyond a single-page context inevitably requires
us to reconsider the research challenge of efficiency in DU. The rise of LLMs
has enabled a new generation of DU pipelines, which are more flexible and
easier to maintain than separate and specialized subtask modules, but also
more computationally demanding. Importantly, most LLMs are not designed
to handle the multimodality and long context windows of multipage documents,
and are often unaware of the visual and layout semantics of documents.
The research questions for Chapter 6 address the efficiency challenge in DU:
RQ 9. How can we efficiently infuse LLMs with semantic layout awareness for
more focused information extraction?
RQ 10. To what degree can model compression resolve the problem of efficiency
in processing documents?