TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper β’ 2410.23266 β’ Published 22 days ago β’ 19
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper β’ 2410.17243 β’ Published 30 days ago β’ 88
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper β’ 2410.12490 β’ Published Oct 16 β’ 8
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper β’ 2410.12787 β’ Published Oct 16 β’ 30
A Controlled Study on Long Context Extension and Generalization in LLMs Paper β’ 2409.12181 β’ Published Sep 18 β’ 43
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages Paper β’ 2407.19672 β’ Published Jul 29 β’ 55
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Paper β’ 2406.05132 β’ Published Jun 7 β’ 27
What If We Recaption Billions of Web Images with LLaMA-3? Paper β’ 2406.08478 β’ Published Jun 12 β’ 39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper β’ 2406.07476 β’ Published Jun 11 β’ 32
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Paper β’ 2402.03161 β’ Published Feb 5 β’ 14
VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper β’ 2312.14125 β’ Published Dec 21, 2023 β’ 44
Reasons to Reject? Aligning Language Models with Judgments Paper β’ 2312.14591 β’ Published Dec 22, 2023 β’ 17