Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Abstract
Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination (2024)
- An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models (2024)
- SAT: Spatial Aptitude Training for Multimodal Language Models (2024)
- EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios (2024)
- HourVideo: 1-Hour Video-Language Understanding (2024)
- Perception Tokens Enhance Visual Reasoning in Multimodal Language Models (2024)
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper