Running 55 55 Qwen2.5 Omni 7B Demo 🏆 Generate text and speech from input text, audio, images, or video
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Paper • 2503.13399 • Published 9 days ago • 20
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing Paper • 2503.13434 • Published 9 days ago • 24
Edit Transfer: Learning Image Editing via Vision In-Context Relations Paper • 2503.13327 • Published 9 days ago • 25
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization Paper • 2503.12937 • Published 10 days ago • 26
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published 10 days ago • 30
Personalize Anything for Free with Diffusion Transformer Paper • 2503.12590 • Published 10 days ago • 41
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially? Paper • 2503.12349 • Published 11 days ago • 40
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models Paper • 2503.12885 • Published 10 days ago • 41
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills Paper • 2503.12533 • Published 11 days ago • 60
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Paper • 2503.06053 • Published 19 days ago • 95
Frac-Connections: Fractional Extension of Hyper-Connections Paper • 2503.14125 • Published 9 days ago • 19
AudioX: Diffusion Transformer for Anything-to-Audio Generation Paper • 2503.10522 • Published 13 days ago • 21
Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation Paper • 2503.13424 • Published 9 days ago • 26
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era Paper • 2503.12329 • Published 11 days ago • 24