Qwen2.5-Omni Collection End-to-End Omni (text, audio, image, video, and natural speech interaction) model based Qwen2.5 • 3 items • Updated about 1 hour ago • 51
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Paper • 2503.16365 • Published 7 days ago • 34
JARVIS-VLA-v1 Collection Vision-Language-Action Models in Minecraft. • 4 items • Updated 5 days ago • 9
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published 13 days ago • 75
view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM 15 days ago • 348
Quality Estimation Collection SOTA Machine Translation Quality Estimation models • 5 items • Updated Jan 10, 2024 • 5
OLMoE (January 2025) Collection Improved OLMoE for iOS app. Read more: https://allenai.org/blog/olmoe-app • 10 items • Updated 14 days ago • 11
Step-Audio Collection Step-Audio model family, including Audio-Tokenizer, Audio-Chat and TTS • 3 items • Updated Feb 17 • 30
Moshi: a speech-text foundation model for real-time dialogue Paper • 2410.00037 • Published Sep 17, 2024 • 4
Hibiki fr-en Collection Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki. • 5 items • Updated Feb 6 • 52
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 214
SmolLM2 Collection State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 16 items • Updated Feb 20 • 251