--- license: apache-2.0 language: - en library_name: transformers tags: - multimodal - aria ---
Aria
🔗 Try Aria! · 📖 Blog · 📌 Paper · ·🖤 GitHub 💜 Discord · 💙 Twitter
# Highlights - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture. - Aria performs **on par with GPT-4o mini and Gemini 1.5 Flash** across a range of multimodal tasks while maintaining strong performance on **text**-only tasks. - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance. # Key features - **Robust multimodal understanding**: Aria processes various input modalities, including video, images, code, and text. It demonstrates strong performance across diverse downstream tasks such as long-context video and image understanding and OCR. Moreover, it excels in instruction following. - **Flexible image handling**: Aria supports variable image sizes and aspect ratios while maintaining high quality. - **Extended context capacity**: Aria can manage multiple images within a long context window of 64k tokens. - **Advanced text understanding**: Aria demonstrates competitive performance across language and coding tasks. # Model Info | Model | Download | Parameter | Context Length | | :---- | :------- | :------------ | :------ | | Aria | < HF link - TBD> | • Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder)