AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published 1 day ago • 17
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation Paper • 2412.02259 • Published 1 day ago • 43
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information Paper • 2412.00947 • Published 3 days ago • 7
VLSBench: Unveiling Visual Leakage in Multimodal Safety Paper • 2411.19939 • Published 5 days ago • 7
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters Paper • 2412.00174 • Published 5 days ago • 17
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models Paper • 2412.01822 • Published 2 days ago • 10
Open-Sora Plan: Open-Source Large Video Generation Model Paper • 2412.00131 • Published 6 days ago • 25
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models Paper • 2412.01824 • Published 2 days ago • 52
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video Paper • 2411.18671 • Published 7 days ago • 15