ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 86
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning Paper • 2503.13444 • Published 10 days ago • 13
Edit Transfer: Learning Image Editing via Vision In-Context Relations Paper • 2503.13327 • Published 10 days ago • 25
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary Paper • 2503.09402 • Published 15 days ago • 6
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer Paper • 2503.07027 • Published 17 days ago • 26
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles Paper • 2503.03651 • Published 22 days ago • 16
Cosmos World Foundation Model Platform for Physical AI Paper • 2501.03575 • Published Jan 7 • 75
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models Paper • 2503.01774 • Published 24 days ago • 41
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data Paper • 2502.14397 • Published Feb 20 • 38