metadata
license: apache-2.0
language:
- en
library_name: transformers
tags:
- multimodal
- aria
Aria
🔗 Try Aria! · 📖 Blog · 📌 Paper · ·🖤 GitHub 💜 Discord · 💙 Twitter
Highlights
- Aria is the first open multimodal native MoE model, capable of seamlessly handling various input modalities within a MoE architecture.
- Aria performs on par with GPT-4o mini and Gemini 1.5 Flash across a range of multimodal tasks while maintaining strong performance on text-only tasks.
- Compared to similar or even larger models, Aria boasts faster speeds and lower costs. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the fewest among models with comparable performance.
Key features
- Robust multimodal understanding: Aria processes various input modalities, including video, images, code, and text. It demonstrates strong performance across diverse downstream tasks such as long-context video and image understanding and OCR. Moreover, it excels in instruction following.
- Flexible image handling: Aria supports variable image sizes and aspect ratios while maintaining high quality.
- Extended context capacity: Aria can manage multiple images within a long context window of 64k tokens.
- Advanced text understanding: Aria demonstrates competitive performance across language and coding tasks.
Model Info
Model | Download | Parameter | Context Length |
---|---|---|---|
Aria | < HF link - TBD> | • Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder) • Total: 25.3B |
64K |
Benchmark
Quick Start
License
This repo is released under the Apache 2.0 License.