Papers
arxiv:2402.14767

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

Published on Feb 22
Authors:
,
,
,
,

Abstract

We present DualFocus, a novel framework for integrating macro and micro perspectives within multi-modal large language models (MLLMs) to enhance vision-language task performance. Current MLLMs typically singularly focus on inputs at a predefined resolution, resulting in deficiencies in detailed questions involving local regions. We introduced a DualFocus mechanism where the model concentrates on the image from a macro perspective, responses to the question, and identifies suitable sub-regions to zoom in for subsequent micro perspective analysis. Via the integration of answers from both macro and micro perspectives, the model is adept at addressing tasks that encompass global, detailed, and combined considerations. To endows the DualFocus mechanism in MLLMs, we curated a tailored dataset derived from the Visual Genome (VG) and adapted it to align with the training regimen of DualFocus. Through comparative studies across different model sizes and benchmarks, we demonstrate DualFocus's superiority in balancing detailed examination with holistic insight, significantly reducing hallucination instances in MLLMs and improving their performance in various vision-language tasks.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.14767 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.14767 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.14767 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.