mlx-community/LocateAnything-3B-8bit Image-Text-to-Text • 1B • Updated 30 days ago • 1.43k • 8
unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit Image-Text-to-Text • 9B • Updated Oct 31, 2025 • 21.1k • 22
google/siglip2-base-patch16-512 Zero-Shot Image Classification • 0.4B • Updated Feb 21, 2025 • 115k • 47
microsoft/Phi-4-multimodal-instruct Automatic Speech Recognition • 6B • Updated Dec 10, 2025 • 509k • 1.61k
microsoft/Phi-3.5-vision-instruct Image-Text-to-Text • 4B • Updated Dec 10, 2025 • 1.47M • 736
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Paper • 2605.30280 • Published May 28 • 146
huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated Image-Text-to-Text • 4B • Updated Dec 15, 2025 • 11.1k • 75
meta-llama/Llama-4-Scout-17B-16E-Instruct Image-Text-to-Text • 109B • Updated May 22, 2025 • 732k • • 1.31k