V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Abstract
Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.
Community
📢 New Benchmark Release | V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
💡Key Innovations
V-STaR is the first benchmark explicitly designed to evaluate Video-LLM’s spatio-temporal reasoning ability in answering questions explicitly in the context
of “when”, “where”, and “what”, spanning:
- 9 video domains
- 2094 spatio-temporal reasoning samples
- 2 reverse Spatio-Temporal Reasoning (RSTR) question chains: "what-when-where" or "what-where-when"
- A github MLLM reasoning collection repository: Awesome-MLLM-Reasoning-Collection
V-STaR reveals a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning and inspires research in improving trustworthy spatio-temporal understanding in future Video-LLMs.
👉Try it Now: GitHub | HuggingFace
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Fine-Grained Video Question Answering (2025)
- VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity (2025)
- TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs (2025)
- Cross-modal Causal Relation Alignment for Video Question Grounding (2025)
- Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation (2025)
- SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding (2025)
- Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper