Towards Visual Text Grounding of Multimodal Large Language Model
Abstract
Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding (2025)
- PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks (2025)
- M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? (2025)
- A Token-level Text Image Foundation Model for Document Understanding (2025)
- Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review (2025)
- QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding (2025)
- MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper