VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published about 17 hours ago • 9
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models Paper • 2503.09669 • Published 1 day ago • 25
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing Paper • 2503.10613 • Published about 16 hours ago • 32
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary Paper • 2503.09402 • Published 2 days ago • 6
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Paper • 2503.04724 • Published 8 days ago • 60
view article Article A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality 10 days ago • 65
C4AI Aya Vision Collection Aya Vision is a state-of-the-art family of vision models that brings multimodal capabilities to 23 languages. • 5 items • Updated 10 days ago • 63
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Paper • 2502.16033 • Published 20 days ago • 16
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published 22 days ago • 129
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published 26 days ago • 142
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 203