R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization Paper • 2503.12937 • Published 28 days ago • 27
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published 29 days ago • 33
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing Paper • 2503.13434 • Published 28 days ago • 25
Personalize Anything for Free with Diffusion Transformer Paper • 2503.12590 • Published 29 days ago • 43
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control Paper • 2503.03751 • Published Mar 5 • 20
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models Paper • 2502.01061 • Published Feb 3 • 212
Gradio WebRTC Cookbook ⚡️ Collection Collection of real-time voice and video demos built with gradio-webrtc custom component • 8 items • Updated Dec 10, 2024 • 17
Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages Paper • 2412.09025 • Published Dec 12, 2024 • 4
VisionArena: 230K Real World User-VLM Conversations with Preference Labels Paper • 2412.08687 • Published Dec 11, 2024 • 13
🪐 SmolLM Collection A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated 14 days ago • 221
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published Nov 7, 2024 • 124
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases Paper • 2402.14905 • Published Feb 22, 2024 • 129
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published Oct 30, 2024 • 51
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper • 2410.17243 • Published Oct 22, 2024 • 94
VidPanos: Generative Panoramic Videos from Casual Panning Videos Paper • 2410.13832 • Published Oct 17, 2024 • 13