File size: 5,352 Bytes
433b819 064c827 433b819 9aa10c2 5bdcccb 433b819 064c827 433b819 064c827 433b819 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal
- aria
base_model:
- rhymes-ai/Aria-Base-64K
---
<!-- <p align="center">
<br>Aria</br>
</p> -->
# Aria-Chat Model Card
<!--
- Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
- Aria performs **on par with GPT-4o mini and Gemini 1.5 Flash** across a range of multimodal tasks while maintaining strong performance on **text**-only tasks.
- Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
-->
## Key features
- **Especially Optimized For Multimodal Chat**: Aria-Chat is especially optimized for open-ended and multi-round dialogs. We hope this version can provide seamless open-source multimodal chat experience.
- **Improved Reliability**: We have improved its reliability for generating long outputs, reducing probabilities for previously-reported bad cases like incomplete responses on Markdown tables, or endless responses on listwise outputs.
- **Better Multi-Lingual Abilities**: We have optimized its ability on non-English scenarios (Chinese, Spanish, French, Japanese, *etc*), including both multi-lingual OCR and multi-lingual dialogs.
<p align="center">
🔗 <a href="https://rhymes.ai/" target="_blank"> Try Aria!</a> · 📖 <a href="https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model" target="_blank">Blog</a> · 📌 <a href="https://arxiv.org/pdf/2410.05993" target="_blank">Paper</a>
· ⭐ <a href="https://github.com/rhymes-ai/Aria" target="_blank">GitHub</a> · 🟣 <a href="https://discord.com/invite/u8HxU23myj" target="_blank"> Discord </a>
</p>
<!-- # Model Info
| Model | Download | Parameter | Context Length |
| :---- | :------- | :------------ | :------ |
| Aria | < HF link - TBD> | • Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder) <br> • Total: 25.3B | 64K | -->
## Benchmark
This checkpoint is not designed for benchmarks, but for real-world open-ended applications. To this end, we evaluated on WildVision-Bench and noticed non-trivial improvements on it:
| Model | Score |
|---------------------------|---------|
| gpt-4o | 89.15 |
| **Aria-Chat** |**81.3** |
| gpt-4-vision-preview | 79.78 |
| Aria | 74.1 |
| Reka-Flash | 64.65 |
| claude-3-opus-20240229 | 62.03 |
| yi-vl-plus | 55.05 |
| liuhaotian/llava-v1.6-34b | 51.89 |
| claude-3-sonnet-20240229 | 50.0 |
| claude-3-haiku-20240307 | 37.83 |
## Quick Start
### Installation
```
pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
pip install flash-attn --no-build-isolation
# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6
```
### Inference
Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.
Here is a code snippet to show you how to use Aria.
```python
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_id_or_path = "rhymes-ai/Aria-Chat"
model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
image = Image.open(requests.get(image_path, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": "what is the image?", "type": "text"},
],
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
output = model.generate(
**inputs,
max_new_tokens=500,
stop_strings=["<|im_end|>"],
tokenizer=processor.tokenizer,
do_sample=True,
temperature=0.9,
)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)
```
### Advanced Inference and Fine-tuning
We provide a [codebase](https://github.com/rhymes-ai/Aria) for more advanced usage of Aria,
including vllm inference, cookbooks, and fine-tuning on custom datasets.
## Citation
If you find our work helpful, please consider citing.
```
@article{aria,
title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
year={2024},
journal={arXiv preprint arXiv:2410.05993},
}
``` |