MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
Abstract
MemoBench presents a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments where objects disappear and reappear in updated states.
Video generation models aspire to simulate dynamic environments, and several benchmarks now evaluate memory consistency across frames. However, most assess consistency only while the target remains in view, and the few that force objects out of view evaluate static scenes where nothing changes during occlusion. To bridge this gap, we introduce MemoBench, a diagnostic benchmark built around the disappear-and-reappear paradigm in dynamically changing environments: a target object undergoes a physical process, disappears from view, and must be correctly recovered in its updated state upon reappearance. We curate 360 ground-truth clips spanning synthetic and real-world scenes, and design an evaluation suite combining automated metrics with VQA-based assessment across four diagnostic pillars. Evaluation of eight state-of-the-art models reveals key insights and open challenges regarding memory consistency under the disappear-and-reappear paradigm.
Community
MemoBench is a diagnostic benchmark for visual memory in dynamic world modeling.
It asks a simple question: when an object disappears from view while undergoing a physical process, can a video/world model recover its updated state when it reappears?
We curate 360 ground-truth clips across synthetic and real-world scenes, and evaluate state-of-the-art models with automated metrics and VQA-based assessment. Our results show that current models can often generate visually coherent videos, but still struggle to preserve and update object states during occlusion.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MBench: A Comprehensive Benchmark on Memory Capability for Video World Models (2026)
- Rethinking Object-Centric Representations for Video Dynamics Modeling (2026)
- CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models (2026)
- WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation (2026)
- TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models (2026)
- WorldOlympiad: Can Your World Model Survive a Triathlon? (2026)
- PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.27537 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper