Papers
arxiv:2309.08872

PDFTriage: Question Answering over Long, Structured Documents

Published on Sep 16, 2023
· Featured in Daily Papers on Sep 19, 2023
Authors:
,
,

Abstract

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA.

Community

Interesting approach!

Dear authors,
Any chance you could share a link to a dataset that is mentioned in the abstract?

Dear authors,
Any chance you could share a link to a dataset that is mentioned in the abstract?

Sorry about that, we are waiting for the permission to release it.

Thanks for the speedy response!
Please let us know once you get the permission

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.08872 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.08872 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.08872 in a Space README.md to link it from this page.

Collections including this paper 29