VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Paper • 2507.13348 • Published about 24 hours ago • 49
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning Paper • 2507.12508 • Published 2 days ago • 14
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes Paper • 2507.11407 • Published 3 days ago • 40
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models Paper • 2507.07104 • Published 9 days ago • 35
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published 8 days ago • 44
Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective Paper • 2507.08801 • Published 7 days ago • 28
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Paper • 2507.07990 • Published 8 days ago • 44
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Paper • 2507.07984 • Published 8 days ago • 36
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling Paper • 2507.07982 • Published 8 days ago • 30
VideoPrism: A Foundational Visual Encoder for Video Understanding Paper • 2402.13217 • Published Feb 20, 2024 • 35
StreamDiT: Real-Time Streaming Text-to-Video Generation Paper • 2507.03745 • Published 14 days ago • 28
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks Paper • 2507.01955 • Published 16 days ago • 34
Energy-Based Transformers are Scalable Learners and Thinkers Paper • 2507.02092 • Published 16 days ago • 55
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Paper • 2507.01925 • Published 16 days ago • 32