Submitted by jiuhai 47 BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset · 13 authors 2
Submitted by xiaomoguhzz 35 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception · 6 authors 2
Submitted by nielsr 25 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures · 15 authors 2
Submitted by toshas 13 Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis · 8 authors 1
Submitted by HanjungKim 12 UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations · 6 authors 1
Submitted by akhaliq 6 CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image · 9 authors 2
Submitted by novateur 5 WavReward: Spoken Dialogue Models With Generalist Reward Evaluators · 14 authors 2
Submitted by NadMag 4 LightLab: Controlling Light Sources in Images with Diffusion Models · 7 authors 2
Submitted by pritamqu 4 VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models · 2 authors 1
Submitted by kailassrt 2 DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition · 11 authors 1
Submitted by scikkk 1 MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning · 11 authors 1
Submitted by JadeCheng 1 Visually Interpretable Subtask Reasoning for Visual Question Answering · 3 authors 1
Submitted by kkr5155 1 Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA · 4 authors 1
Submitted by peihaowang 1 Steepest Descent Density Control for Compact 3D Gaussian Splatting · 11 authors 1