Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published 7 days ago • 60
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Paper • 2412.10302 • Published 6 days ago • 5
Large Concept Models: Language Modeling in a Sentence Representation Space Paper • 2412.08821 • Published 8 days ago • 5
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 13 days ago • 111
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper • 2412.07760 • Published 9 days ago • 49
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published 7 days ago • 88
TRACE Collection TRACE: Temporal Grounding Video LLM via Casual Event Modeling • 10 items • Updated 8 days ago • 1