Papers
arxiv:2504.09130

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Published on Apr 12
Ā· Submitted by LibraTree on Apr 15
Authors:
,
,

Abstract

Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Community

Paper author Paper submitter

šŸ“¢ VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

šŸ¤” Current LVLMs struggle with complex reasoning like multi-hop geometry problems. How can AI agents utilize and construct more useful visual hints?

šŸ”‘ Key insight: When LVLMs perform reasoning, they need not only "WHAT to do" but also a mental model of "WHAT WILL HAPPEN after each action"! This brings LVLMs more powerful reasoning performance. #NextLevelAI šŸ¤–

Paper author Paper submitter

Current methods focus on visual-aided reasoning or test scaling. Our VisuoThink framework combines both and introduces a mechanism called lookahead tree search.
ęˆŖå±2025-04-12 16.59.18.png

Paper author Paper submitter

Through exploring different trajectories and predicting what-will-happen, LVLMs construct more reliable auxiliary lines when solving geometry problems and perform better in spatial reasoning tasks.

ęˆŖå±2025-04-12 16.59.41.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.09130 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.09130 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09130 in a Space README.md to link it from this page.

Collections including this paper 2