Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities Paper • 2505.02567 • Published 18 days ago • 71
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models Paper • 2504.10449 • Published Apr 14 • 12
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search Paper • 2504.08066 • Published Apr 10 • 11
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search Paper • 2504.09130 • Published Apr 12 • 12
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning Paper • 2504.09641 • Published Apr 13 • 16
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization Paper • 2504.10127 • Published Apr 14 • 17
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Paper • 2504.10068 • Published Apr 14 • 30
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability Paper • 2504.08003 • Published Apr 9 • 49
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper • 2504.08837 • Published Apr 10 • 42
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 259
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought Paper • 2504.05599 • Published Apr 8 • 81
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Paper • 2504.07934 • Published Apr 10 • 18
Towards Visual Text Grounding of Multimodal Large Language Model Paper • 2504.04974 • Published Apr 7 • 16
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models Paper • 2504.07951 • Published Apr 10 • 27
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning Paper • 2504.07128 • Published Apr 2 • 84