Abstract
Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.
Community
Project page: https://cvlab-kaist.github.io/URECA/
Code: https://github.com/cvlab-kaist/URECA
ArXiv: https://arxiv.org/abs/2504.05305
That's a great name 👍
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs (2025)
- Image Embedding Sampling Method for Diverse Captioning (2025)
- Fine-Grained Video Captioning through Scene Graph Consolidation (2025)
- GOAL: Global-local Object Alignment Learning (2025)
- Large-scale Pre-training for Grounded Video Caption Generation (2025)
- GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding (2025)
- OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper