RichardErkhov commited on
Commit
15ed6c4
·
verified ·
1 Parent(s): 9bea9f6

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ Aquila-VL-2B-llava-qwen - AWQ
11
+ - Model creator: https://huggingface.co/BAAI/
12
+ - Original model: https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: apache-2.0
20
+ language:
21
+ - en
22
+ - zh
23
+ tags:
24
+ - multimodal
25
+ library_name: transformers
26
+ datasets:
27
+ - BAAI/Infinity-MM
28
+ - BAAI/Infinity-Instruct
29
+ - BAAI/Infinity-Preference
30
+ base_model:
31
+ - Qwen/Qwen2.5-1.5B-Instruct
32
+ - google/siglip-so400m-patch14-384
33
+ pipeline_tag: visual-question-answering
34
+ ---
35
+
36
+ ![mof-class1](https://mot.isitopen.ai/model/1130/badge/1)
37
+
38
+ # Introduction
39
+
40
+ The **Aquila-VL-2B** model is a vision-language model (VLM) trained based on the [LLava-one-vision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) framework. The [Qwen2.5-1.5B-instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model is chose as the LLM, while [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) is utilized as the vision tower.
41
+
42
+ The model was trained on our self-built Infinity-MM dataset, which contains approximately 40 million image-text pairs. This dataset is a combination of open-source data collected from the internet and synthetic instruction data generated using open-source VLM models.
43
+
44
+
45
+ We have open-sourced [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset and related resources. We hope you enjoy using them!
46
+
47
+ ## News
48
+ - `2024/11/19`: We have released [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-VL-2B-Intermediate) obtained during different stages of training. Please feel free to use these models for analysis and experimentation.
49
+ - `2024/10/25`: The [Aquila-VL-2B](https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen) model and [Infinity-MM](https://huggingface.co/datasets/BAAI/Infinity-MM) dataset are now available. We have also released the [technical report](https://arxiv.org/abs/2410.18558) simultaneously.
50
+
51
+ # Evaluation
52
+
53
+ We evaluated the model using the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) tool. Whenever possible, we prioritized using the OpenAI API for test sets that support API-based evaluation.
54
+
55
+ | Benchmark | MiniCPM-V-2 | InternVL2-2B | XinYuan-VL-2B | Qwen2-VL-2B-Instruct | Aquila-VL-2B |
56
+ | :--------------------------- | :---------: | :----------: | :-----------: | :------------------: | :----------: |
57
+ | MMBench-EN<sub>test</sub> | 69.4 | 73.4 | **78.9** | 74.9 | 78.8 |
58
+ | MMBench-CN<sub>test</sub> | 65.9 | 70.9 | 76.1 | 73.9 | **76.4** |
59
+ | MMBench_V1.1<sub>test</sub> | 65.2 | 69.7 | **75.4** | 72.7 | 75.2 |
60
+ | MMT-Bench<sub>test</sub> | 54.5 | 53.3 | 57.2 | 54.8 | **58.2** |
61
+ | RealWorldQA | 55.4 | 57.3 | 63.9 | 62.6 | **63.9** |
62
+ | HallusionBench | 36.8 | 38.1 | 36.0 | 41.5 | **43.0** |
63
+ | SEEDBench2<sub>plus</sub> | 51.8 | 60.0 | 63.0 | 62.4 | **63.0** |
64
+ | LLaVABench | 66.1 | 64.8 | 42.4 | 52.5 | **68.4** |
65
+ | MMStar | 41.6 | 50.2 | 51.9 | 47.8 | **54.9** |
66
+ | POPE | 86.6 | 85.3 | **89.4** | 88.0 | 83.6 |
67
+ | MMVet | 44.0 | 41.1 | 42.7 | **50.7** | 44.3 |
68
+ | MMMU<sub>val</sub> | 39.6 | 34.9 | 43.6 | 41.7 | **47.4** |
69
+ | ScienceQA<sub>test</sub> | 80.4 | 94.1 | 86.6 | 78.1 | **95.2** |
70
+ | AI2D<sub>test</sub> | 64.8 | 74.4 | 74.2 | 74.6 | **75.0** |
71
+ | MathVista<sub>testmini</sub> | 39.0 | 45.0 | 47.1 | 47.9 | **59.0** |
72
+ | MathVerse<sub>testmini</sub> | 19.8 | 24.7 | 22.2 | 21.0 | **26.2** |
73
+ | MathVision | 15.4 | 12.6 | 16.3 | 17.5 | **18.4** |
74
+ | DocVQA<sub>test</sub> | 71.0 | 86.9 | 87.6 | **89.9** | 85.0 |
75
+ | InfoVQA<sub>test</sub> | 40.0 | 59.5 | 59.1 | **65.4** | 58.3 |
76
+ | ChartQA<sub>test</sub> | 59.6 | 71.4 | 57.1 | 73.5 | **76.5** |
77
+ | TextVQA<sub>val</sub> | 74.3 | 73.5 | 77.6 | **79.9** | 76.4 |
78
+ | OCRVQA<sub>testcore</sub> | 54.4 | 40.2 | 67.6 | **68.7** | 64.0 |
79
+ | VCR<sub>en easy</sub> | 27.6 | 51.6 | 67.7 | 68.3 | **70.0** |
80
+ | OCRBench | 613 | 784 | 782 | **810** | 772 |
81
+ | Average | 53.5 | 58.8 | 60.9 | 62.1 | **64.1** |
82
+
83
+
84
+
85
+ For comparison models, evaluations were conducted in a local environment, so the scores may differ slightly from those reported in papers or on the official VLMEvalKit leaderboard.
86
+
87
+ # How to use
88
+
89
+ ```python
90
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
91
+ from llava.model.builder import load_pretrained_model
92
+ from llava.mm_utils import process_images, tokenizer_image_token
93
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
94
+ from llava.conversation import conv_templates
95
+ from PIL import Image
96
+ import requests
97
+ import copy
98
+ import torch
99
+ import warnings
100
+
101
+ warnings.filterwarnings("ignore")
102
+
103
+ pretrained = "BAAI/Aquila-VL-2B-llava-qwen"
104
+
105
+ model_name = "llava_qwen"
106
+ device = "cuda"
107
+ device_map = "auto"
108
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
109
+
110
+ model.eval()
111
+
112
+ # load image from url
113
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
114
+ image = Image.open(requests.get(url, stream=True).raw)
115
+
116
+ # load image from local environment
117
+ # url = "./local_image.jpg"
118
+ # image = Image.open(url)
119
+
120
+ image_tensor = process_images([image], image_processor, model.config)
121
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
122
+
123
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
124
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
125
+ conv = copy.deepcopy(conv_templates[conv_template])
126
+ conv.append_message(conv.roles[0], question)
127
+ conv.append_message(conv.roles[1], None)
128
+ prompt_question = conv.get_prompt()
129
+
130
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
131
+ image_sizes = [image.size]
132
+
133
+ cont = model.generate(
134
+ input_ids,
135
+ images=image_tensor,
136
+ image_sizes=image_sizes,
137
+ do_sample=False,
138
+ temperature=0,
139
+ max_new_tokens=4096,
140
+ )
141
+
142
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
143
+
144
+ print(text_outputs)
145
+ ```
146
+
147
+
148
+
149
+ # Future Plan
150
+
151
+ * We plan to train models of various sizes.
152
+ * Future training will incorporate multi-image and video data.
153
+
154
+
155
+ ## **Citation**
156
+ If you find this useful, please cite the following work
157
+ ```
158
+ @misc{gu2024infinitymmscalingmultimodalperformance,
159
+ title={Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data},
160
+ author={Shuhao Gu and Jialing Zhang and Siyuan Zhou and Kevin Yu and Zhaohu Xing and Liangdong Wang and Zhou Cao and Jintao Jia and Zhuoyi Zhang and Yixuan Wang and Zhenchong Hu and Bo-Wen Zhang and Jijie Li and Dong Liang and Yingli Zhao and Yulong Ao and Yaoqi Liu and Fangxiang Feng and Guang Liu},
161
+ year={2024},
162
+ eprint={2410.18558},
163
+ archivePrefix={arXiv},
164
+ primaryClass={cs.CL},
165
+ url={https://arxiv.org/abs/2410.18558},
166
+ }
167
+ ```
168
+