teowu commited on
Commit
8bb56c2
1 Parent(s): b222401

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ tags:
8
+ - multimodal
9
+ - aria
10
+ ---
11
+ <!-- <p align="center">
12
+ <br>Aria</br>
13
+ </p> -->
14
+
15
+
16
+ # Aria-Base-64K Model Card
17
+
18
+ <p align="center">
19
+ 🔗 <a href="https://rhymes.ai/" target="_blank"> Try Aria!</a> · 📖 <a href="https://www.rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model" target="_blank">Blog</a> · 📌 <a href="https://arxiv.org/pdf/2410.05993" target="_blank">Paper</a>
20
+ · ⭐ <a href="https://github.com/rhymes-ai/Aria" target="_blank">GitHub</a> · 🟣 <a href="https://discord.com/invite/u8HxU23myj" target="_blank"> Discord </a>
21
+ </p>
22
+
23
+
24
+ This checkpoint is one of base models of [Aria](https://huggingface.co/rhymes-ai/Aria), designed for research purposes as well as continue training. Specifically, Aria-Base-64K corresponds to the model checkpoint after the long-context pre-training stage (boxed in purple).
25
+
26
+ <img src="./aria-stages.png" alt="Aria Training Stages" style="width: 75%;">
27
+
28
+ Aria-Base-64K is fine-tuned from [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
29
+
30
+ <!--
31
+ - Aria is the **first open multimodal native MoE** model, capable of seamlessly handling various input modalities within a MoE architecture.
32
+ - Aria performs **on par with GPT-4o mini and Gemini 1.5 Flash** across a range of multimodal tasks while maintaining strong performance on **text**-only tasks.
33
+ - Compared to similar or even larger models, Aria boasts **faster speeds** and **lower costs**. This high efficiency stems from its ability to activate only 3.9B parameters during inference – the **fewest** among models with comparable performance.
34
+ -->
35
+
36
+ ## Aria-Base-8K
37
+
38
+ - **Base Model After Long-Context Pre-training**: This model corresponds to the model checkpoint after the long-context pre-training stage, with 33B tokens (21B multimodal, 12B language, 69% in long-form) trained in this stage. This stage lasts 1,000 iterations, with all sequences packed to 65536 with Megatron-LM, with global batch size 512. During this training stage, the learning rate keeps constant at `3.5e-5`.
39
+ - **Appropriate for Video and Long-document Fine-tuning**: This model is recommended for long-form continue pre-training or fine-tuning, e.g. on video QA datasets or long-document QA datasets. While resource is limited, it is also possible to post-train this model with short instruction tuning datasets and transfer to long-form QA scenarios.
40
+ - **Understanding on Hundreds of Images**: This model is capable of understanding up to 250 high-resolution images or up to 500 mid-resolution images.
41
+ - **Strong Base Performance on Language and Multimodal Scenarios**: This model retains strong base performance as [Aria-Base-8K](https://huggingface.co/teowu/Aria-Base-8K).
42
+ - ***Limited Chat Template Availability***: This model is trained with a very low percentage of data (around 3%) re-formatted with the chat template. Hence, it might not be optimal to be directly tested with various benchmarks.
43
+
44
+ <!-- # Model Info
45
+
46
+ | Model | Download | Parameter | Context Length |
47
+ | :---- | :------- | :------------ | :------ |
48
+ | Aria | < HF link - TBD> | • Activation: 3.9B (3.5B MoE + 0.4B Visual Encoder) <br> • Total: 25.3B | 64K | -->
49
+
50
+ ## Benchmark
51
+
52
+ N/A.
53
+
54
+ ## Quick Start
55
+ ### Installation
56
+ ```
57
+ pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
58
+ pip install flash-attn --no-build-isolation
59
+
60
+ # For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
61
+ pip install grouped_gemm==0.1.6
62
+ ```
63
+
64
+ ### Inference
65
+
66
+ You can use the same method as the final Aria model to load this checkpoint. However, as the base model, it might not be able to yield optimal chat performance.
67
+
68
+ ```python
69
+ import requests
70
+ import torch
71
+ from PIL import Image
72
+ from transformers import AutoModelForCausalLM, AutoProcessor
73
+
74
+ model_id_or_path = "teowu/Aria-Base-64K"
75
+
76
+ model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
77
+
78
+ processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
79
+
80
+ image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
81
+
82
+ image = Image.open(requests.get(image_path, stream=True).raw)
83
+
84
+ messages = [
85
+ {
86
+ "role": "user",
87
+ "content": [
88
+ {"text": None, "type": "image"},
89
+ {"text": "what is the image?", "type": "text"},
90
+ ],
91
+ }
92
+ ]
93
+
94
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
95
+ inputs = processor(text=text, images=image, return_tensors="pt")
96
+ inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
97
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
98
+
99
+ with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
100
+ output = model.generate(
101
+ **inputs,
102
+ max_new_tokens=500,
103
+ stop_strings=["<|im_end|>"],
104
+ tokenizer=processor.tokenizer,
105
+ do_sample=True,
106
+ temperature=0.9,
107
+ )
108
+ output_ids = output[0][inputs["input_ids"].shape[1]:]
109
+ result = processor.decode(output_ids, skip_special_tokens=True)
110
+
111
+ print(result)
112
+ ```
113
+
114
+ ### Advanced Inference and Fine-tuning
115
+
116
+ We provide a [codebase](https://github.com/rhymes-ai/Aria) for more advanced usage of Aria,
117
+ including vllm inference, cookbooks, and fine-tuning on custom datasets.
118
+
119
+ As it shares the same structure with the final model,
120
+ you may just replace the `rhymes-ai/Aria` to this model path for any advanced inference and fine-tuning.
121
+
122
+
123
+ ## Citation
124
+ If you find our work helpful, please consider citing.
125
+ ```
126
+ @article{aria,
127
+ title={Aria: An Open Multimodal Native Mixture-of-Experts Model},
128
+ author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
129
+ year={2024},
130
+ journal={arXiv preprint arXiv:2410.05993},
131
+ }
132
+ ```