InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions Paper • 2412.09596 • Published 18 days ago • 92
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Paper • 2412.04454 • Published 25 days ago • 53
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published 25 days ago • 104
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published 20 days ago • 70
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 17 days ago • 132
Evaluating Language Models as Synthetic Data Generators Paper • 2412.03679 • Published 26 days ago • 43
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Paper • 2412.04455 • Published 25 days ago • 35
ProcessBench: Identifying Process Errors in Mathematical Reasoning Paper • 2412.06559 • Published 21 days ago • 69
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation Paper • 2412.06531 • Published 21 days ago • 71
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 24 days ago • 121
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models Paper • 2412.01824 • Published 28 days ago • 65
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Paper • 2412.07589 • Published 20 days ago • 45
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective Paper • 2410.23743 • Published Oct 31 • 59