arxiv:2401.00908

DocLLM: A layout-aware generative language model for multimodal document understanding

Published on Dec 31, 2023

· Submitted by

akhaliq on Jan 3

#1 Paper of the day

Upvote

179

Authors:

Dongsheng Wang ,

Mathieu Sibue ,

Zhiqiang Ma ,

Armineh Nourbakhsh ,

Xiaomo Liu

Abstract

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

View arXiv page View PDF Add to collection

Community

ori1472

Jan 3

This comment has been hidden

deleted

Jan 3

Sorry I dozed off when I heard "enterprise documents" but I'm awake again now. What about, for instance, service repair manuals? These are a lot more challenging to address than invoices for your NIPS hotel.

What 16 datasets? The link to the PDF doesn't work. Why are these 16 datasets considered SoA? Why are dataset benchmarks the only indicator of "progress" in NLP?

What is a "textual modality"? What is your model of "textual semantics"? What is "textual semantics"? Are you talking about natural language semantics? How do you integrate linguistic knowledge with "spatial modalities"?

sebasgar

Jan 3

Are code and model weights available for this model?

paulofinardi

Jan 4

Any plans to release a space demo?

shubh1608

Jan 4

Waiting for the model weights and model card to use it. If there's a plan, please release it soon.

esynergy

Jan 4

Hey folks, where is the model? how can we try it

hontou-ni-baka

Jan 4

Does anyone know what OCR engine have they used / what is the best one for commercial use? I feel like GPT4 is bottle necked by the OCR.

Prakyathkantharaju

Jan 4

Looks like they are using tesseract

smanes

Jan 6

Where can I access the code for this model?

jtdavies

Jan 8

•

edited Jan 9

I read the paper, then came to HuggingFace for the model, disappointed not to find it here.

doc2txt

Jan 8

model and code please?

oliviermills

Jan 9

Great tease.. model and code?

quantumJLBass

Jan 10

Tease... (っ °Д °;)っ

jordyvl

Jan 11

•

edited Jan 12

EDIT: discussed via email, authors will update arxiv to reflect current results are on the validation set; and they will make an effort to add their results on the public leaderboard for the test set ^^

Hi, main author of the DUDE dataset here. How did you generate results on DUDE without submitting predictions for evaluation on the test set to the RRC platform? (https://rrc.cvc.uab.es/?ch=23&com=evaluation&task=1)

Similarly, how did you generate the results for GPT-4?

If you used the validation set for evaluation, please contact me for help in getting unbiased results on the test set ;)

deleted

Jan 11

minlik

Jan 12

This comment has been hidden

Joon2023

Jan 20

The reference object is too weak, It does not mean that the paper is better one

nhkhoi91

Jan 25

so still no code yet?

josieldelgadillo

Feb 9

Any updates on when this model will become available?

deleted

Feb 9

C'mon Big Banking! The people have spoken! Give the people what they want! Give them the modalities!

JinghuiLuAstronaut

Mar 13

I have reimplemented the model architecture based on baichuan2-7b which is available at https://huggingface.co/JinghuiLuAstronaut/DocLLM_baichuan2_7b, however, the newly added parameters are random initialized, you can continuous pre-training or fine-tuning