metadata

license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - multimodal
  - aria

Aria

🔗 Try Aria! · 📖 Blog · 📌 Paper · ·🖤 GitHub 💜 Discord · 💙 Twitter

Highlights

Aria is the first open multimodal native MoE model, capable of seamlessly handling various input modalities within a MoE architecture.
Aria performs on par with GPT-4o mini and Gemini 1.5 Flash across a range of multimodal tasks while maintaining strong performance on text-only tasks.
Compared to similar or even larger models, Aria boasts faster speeds and lower costs. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the fewest among models with comparable performance.

Key features

Robust multimodal understanding: Aria processes various input modalities, including video, images, code, and text. It demonstrates strong performance across diverse downstream tasks such as long-context video and image understanding and OCR. Moreover, it excels in instruction following.
Flexible image handling: Aria supports variable image sizes and aspect ratios while maintaining high quality.
Extended context capacity: Aria can manage multiple images within a long context window of 64k tokens.
Advanced text understanding: Aria demonstrates competitive performance across language and coding tasks.

Model	Download	Parameter	Context Length
Aria	< HF link - TBD>	• Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder) • Total: 25.3B	64K

This repo is released under the Apache 2.0 License.