Running 273 273 Qwen2.5 Omni 7B Demo 🏆 Generate text and speech responses from text, images, or audio input
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14 • 98