zyoNoob Benasd commited on
Commit
b869af1
·
verified ·
0 Parent(s):

Duplicate from Benasd/Qwen2.5-VL-3B-Instruct-AWQ

Browse files

Co-authored-by: Ben Tso <Benasd@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Qwen RESEARCH LICENSE AGREEMENT
2
+
3
+ Qwen RESEARCH LICENSE AGREEMENT Release Date: September 19, 2024
4
+
5
+ By clicking to agree or by using or distributing any portion or element of the Qwen Materials, you will be deemed to have recognized and accepted the content of this Agreement, which is effective immediately.
6
+
7
+ 1. Definitions
8
+ a. This Qwen RESEARCH LICENSE AGREEMENT (this "Agreement") shall mean the terms and conditions for use, reproduction, distribution and modification of the Materials as defined by this Agreement.
9
+ b. "We" (or "Us") shall mean Alibaba Cloud.
10
+ c. "You" (or "Your") shall mean a natural person or legal entity exercising the rights granted by this Agreement and/or using the Materials for any purpose and in any field of use.
11
+ d. "Third Parties" shall mean individuals or legal entities that are not under common control with us or you.
12
+ e. "Qwen" shall mean the large language models, and software and algorithms, consisting of trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by us.
13
+ f. "Materials" shall mean, collectively, Alibaba Cloud's proprietary Qwen and Documentation (and any portion thereof) made available under this Agreement.
14
+ g. "Source" form shall mean the preferred form for making modifications, including but not limited to model source code, documentation source, and configuration files.
15
+ h. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
16
+ i. "Non-Commercial" shall mean for research or evaluation purposes only.
17
+
18
+ 2. Grant of Rights
19
+ a. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud's intellectual property or other rights owned by us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials FOR NON-COMMERCIAL PURPOSES ONLY.
20
+ b. If you are commercially using the Materials, you shall request a license from us.
21
+
22
+ 3. Redistribution
23
+ You may distribute copies or make the Materials, or derivative works thereof, available as part of a product or service that contains any of them, with or without modifications, and in Source or Object form, provided that you meet the following conditions:
24
+ a. You shall give any other recipients of the Materials or derivative works a copy of this Agreement;
25
+ b. You shall cause any modified files to carry prominent notices stating that you changed the files;
26
+ c. You shall retain in all copies of the Materials that you distribute the following attribution notices within a "Notice" text file distributed as a part of such copies: "Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved."; and
27
+ d. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the work otherwise complies with the terms and conditions of this Agreement.
28
+
29
+ 4. Rules of use
30
+ a. The Materials may be subject to export controls or restrictions in China, the United States or other countries or regions. You shall comply with applicable laws and regulations in your use of the Materials.
31
+ b. If you use the Materials or any outputs or results therefrom to create, train, fine-tune, or improve an AI model that is distributed or made available, you shall prominently display “Built with Qwen” or “Improved using Qwen” in the related product documentation.
32
+
33
+ 5. Intellectual Property
34
+ a. We retain ownership of all intellectual property rights in and to the Materials and derivatives made by or for us. Conditioned upon compliance with the terms and conditions of this Agreement, with respect to any derivative works and modifications of the Materials that are made by you, you are and will be the owner of such derivative works and modifications.
35
+ b. No trademark license is granted to use the trade names, trademarks, service marks, or product names of us, except as required to fulfill notice requirements under this Agreement or as required for reasonable and customary use in describing and redistributing the Materials.
36
+ c. If you commence a lawsuit or other proceedings (including a cross-claim or counterclaim in a lawsuit) against us or any entity alleging that the Materials or any output therefrom, or any part of the foregoing, infringe any intellectual property or other right owned or licensable by you, then all licenses granted to you under this Agreement shall terminate as of the date such lawsuit or other proceeding is commenced or brought.
37
+
38
+ 6. Disclaimer of Warranty and Limitation of Liability
39
+ a. We are not obligated to support, update, provide training for, or develop any further version of the Qwen Materials or to grant any license thereto.
40
+ b. THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT, OR FITNESS FOR A PARTICULAR PURPOSE. WE MAKE NO WARRANTY AND ASSUME NO RESPONSIBILITY FOR THE SAFETY OR STABILITY OF THE MATERIALS AND ANY OUTPUT THEREFROM.
41
+ c. IN NO EVENT SHALL WE BE LIABLE TO YOU FOR ANY DAMAGES, INCLUDING, BUT NOT LIMITED TO ANY DIRECT, OR INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING FROM YOUR USE OR INABILITY TO USE THE MATERIALS OR ANY OUTPUT OF IT, NO MATTER HOW IT’S CAUSED.
42
+ d. You will defend, indemnify and hold harmless us from and against any claim by any third party arising out of or related to your use or distribution of the Materials.
43
+
44
+ 7. Survival and Termination.
45
+ a. The term of this Agreement shall commence upon your acceptance of this Agreement or access to the Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein.
46
+ b. We may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you must delete and cease use of the Materials. Sections 6 and 8 shall survive the termination of this Agreement.
47
+
48
+ 8. Governing Law and Jurisdiction.
49
+ a. This Agreement and any dispute arising out of or relating to it will be governed by the laws of China, without regard to conflict of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement.
50
+ b. The People's Courts in Hangzhou City shall have exclusive jurisdiction over any dispute arising out of this Agreement.
51
+
52
+ 9. Other Terms and Conditions.
53
+ a. Any arrangements, understandings, or agreements regarding the Material not stated herein are separate from and independent of the terms and conditions of this Agreement. You shall request a separate license from us, if you use the Materials in ways not expressly agreed to in this Agreement.
54
+ b. We shall not be bound by any additional or different terms or conditions communicated by you unless expressly agreed.
README.md ADDED
@@ -0,0 +1,527 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license_name: qwen-research
4
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - multimodal
10
+ library_name: transformers
11
+ base_model:
12
+ - Qwen/Qwen2.5-VL-3B-Instruct
13
+ ---
14
+
15
+ # Qwen2.5-VL-3B-Instruct
16
+ <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
17
+ <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
18
+ </a>
19
+
20
+ ## Introduction
21
+
22
+ In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
23
+
24
+ #### Key Enhancements:
25
+ * **Understand things visually**: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
26
+
27
+ * **Being agentic**: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
28
+
29
+ * **Understanding long videos and capturing events**: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments.
30
+
31
+ * **Capable of visual localization in different formats**: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
32
+
33
+ * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
34
+
35
+
36
+ #### Model Architecture Updates:
37
+
38
+ * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
39
+
40
+ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments.
41
+
42
+ <p align="center">
43
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
44
+ <p>
45
+
46
+
47
+ * **Streamlined and Efficient Vision Encoder**
48
+
49
+ We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
50
+
51
+
52
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
53
+
54
+
55
+
56
+ ## Evaluation
57
+
58
+ ### Image benchmark
59
+
60
+ | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B |
61
+ | :--- | :---: | :---: | :---: |
62
+ | MMMU<sub>val</sub> | 52.3 | 54.1 | 53.1|
63
+ | MMMU-Pro<sub>val</sub> | **32.7** | 30.5 | 31.6|
64
+ | AI2D<sub>test</sub> | 81.4 | **83.0** | 81.5 |
65
+ | DocVQA<sub>test</sub> | 91.6 | 94.5 | **93.9** |
66
+ | InfoVQA<sub>test</sub> | 72.1 | 76.5 | **77.1** |
67
+ | TextVQA<sub>val</sub> | 76.8 | **84.3** | 79.3|
68
+ | MMBench-V1.1<sub>test</sub> | 79.3 | **80.7** | 77.6 |
69
+ | MMStar | 58.3 | **60.7** | 55.9 |
70
+ | MathVista<sub>testmini</sub> | 60.5 | 58.2 | **62.3** |
71
+ | MathVision<sub>full</sub> | 20.9 | 16.3 | **21.2** |
72
+
73
+
74
+ ### Video benchmark
75
+ | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B |
76
+ | :--- | :---: | :---: | :---: |
77
+ | MVBench | 71.6 | 67.0 | 67.0 |
78
+ | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 |
79
+ | MLVU | 48.3 | - | 68.2 |
80
+ | LVBench | - | - | 43.3 |
81
+ | MMBench-Video | 1.73 | 1.44 | 1.63 |
82
+ | EgoSchema | - | - | 64.8 |
83
+ | PerceptionTest | - | - | 66.9 |
84
+ | TempCompass | - | - | 64.4 |
85
+ | LongVideoBench | 55.2 | 55.6 | 54.2 |
86
+ | CharadesSTA/mIoU | - | - | 38.8 |
87
+
88
+
89
+ ### Agent benchmark
90
+ | Benchmarks | Qwen2.5-VL-3B |
91
+ |-------------------------|---------------|
92
+ | ScreenSpot | 55.5 |
93
+ | ScreenSpot Pro | 23.9 |
94
+ | AITZ_EM | 76.9 |
95
+ | Android Control High_EM | 63.7 |
96
+ | Android Control Low_EM | 22.2 |
97
+ | AndroidWorld_SR | 90.8 |
98
+ | MobileMiniWob++_SR | 67.9 |
99
+
100
+ ## Requirements
101
+ The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
102
+ ```
103
+ pip install git+https://github.com/huggingface/transformers accelerate
104
+ ```
105
+ or you might encounter the following error:
106
+ ```
107
+ KeyError: 'qwen2_5_vl'
108
+ ```
109
+
110
+
111
+ ## Quickstart
112
+
113
+ Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
114
+
115
+ The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
116
+ ```
117
+ pip install git+https://github.com/huggingface/transformers accelerate
118
+ ```
119
+ or you might encounter the following error:
120
+ ```
121
+ KeyError: 'qwen2_5_vl'
122
+ ```
123
+
124
+
125
+ We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
126
+
127
+ ```bash
128
+ # It's highly recommanded to use `[decord]` feature for faster video loading.
129
+ pip install qwen-vl-utils[decord]==0.0.8
130
+ ```
131
+
132
+ If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
133
+
134
+ ### Using 🤗 Transformers to Chat
135
+
136
+ Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
137
+
138
+ ```python
139
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
140
+ from qwen_vl_utils import process_vision_info
141
+
142
+ # default: Load the model on the available device(s)
143
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
144
+ "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
145
+ )
146
+
147
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
148
+ # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
149
+ # "Qwen/Qwen2.5-VL-3B-Instruct",
150
+ # torch_dtype=torch.bfloat16,
151
+ # attn_implementation="flash_attention_2",
152
+ # device_map="auto",
153
+ # )
154
+
155
+ # default processer
156
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
157
+
158
+ # The default range for the number of visual tokens per image in the model is 4-16384.
159
+ # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
160
+ # min_pixels = 256*28*28
161
+ # max_pixels = 1280*28*28
162
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
163
+
164
+ messages = [
165
+ {
166
+ "role": "user",
167
+ "content": [
168
+ {
169
+ "type": "image",
170
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
171
+ },
172
+ {"type": "text", "text": "Describe this image."},
173
+ ],
174
+ }
175
+ ]
176
+
177
+ # Preparation for inference
178
+ text = processor.apply_chat_template(
179
+ messages, tokenize=False, add_generation_prompt=True
180
+ )
181
+ image_inputs, video_inputs = process_vision_info(messages)
182
+ inputs = processor(
183
+ text=[text],
184
+ images=image_inputs,
185
+ videos=video_inputs,
186
+ padding=True,
187
+ return_tensors="pt",
188
+ )
189
+ inputs = inputs.to("cuda")
190
+
191
+ # Inference: Generation of the output
192
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
193
+ generated_ids_trimmed = [
194
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
195
+ ]
196
+ output_text = processor.batch_decode(
197
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
198
+ )
199
+ print(output_text)
200
+ ```
201
+ <details>
202
+ <summary>Multi image inference</summary>
203
+
204
+ ```python
205
+ # Messages containing multiple images and a text query
206
+ messages = [
207
+ {
208
+ "role": "user",
209
+ "content": [
210
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
211
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
212
+ {"type": "text", "text": "Identify the similarities between these images."},
213
+ ],
214
+ }
215
+ ]
216
+
217
+ # Preparation for inference
218
+ text = processor.apply_chat_template(
219
+ messages, tokenize=False, add_generation_prompt=True
220
+ )
221
+ image_inputs, video_inputs = process_vision_info(messages)
222
+ inputs = processor(
223
+ text=[text],
224
+ images=image_inputs,
225
+ videos=video_inputs,
226
+ padding=True,
227
+ return_tensors="pt",
228
+ )
229
+ inputs = inputs.to("cuda")
230
+
231
+ # Inference
232
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
233
+ generated_ids_trimmed = [
234
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
235
+ ]
236
+ output_text = processor.batch_decode(
237
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
238
+ )
239
+ print(output_text)
240
+ ```
241
+ </details>
242
+
243
+ <details>
244
+ <summary>Video inference</summary>
245
+
246
+ ```python
247
+ # Messages containing a images list as a video and a text query
248
+ messages = [
249
+ {
250
+ "role": "user",
251
+ "content": [
252
+ {
253
+ "type": "video",
254
+ "video": [
255
+ "file:///path/to/frame1.jpg",
256
+ "file:///path/to/frame2.jpg",
257
+ "file:///path/to/frame3.jpg",
258
+ "file:///path/to/frame4.jpg",
259
+ ],
260
+ },
261
+ {"type": "text", "text": "Describe this video."},
262
+ ],
263
+ }
264
+ ]
265
+
266
+ # Messages containing a local video path and a text query
267
+ messages = [
268
+ {
269
+ "role": "user",
270
+ "content": [
271
+ {
272
+ "type": "video",
273
+ "video": "file:///path/to/video1.mp4",
274
+ "max_pixels": 360 * 420,
275
+ "fps": 1.0,
276
+ },
277
+ {"type": "text", "text": "Describe this video."},
278
+ ],
279
+ }
280
+ ]
281
+
282
+ # Messages containing a video url and a text query
283
+ messages = [
284
+ {
285
+ "role": "user",
286
+ "content": [
287
+ {
288
+ "type": "video",
289
+ "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
290
+ },
291
+ {"type": "text", "text": "Describe this video."},
292
+ ],
293
+ }
294
+ ]
295
+
296
+ #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
297
+ # Preparation for inference
298
+ text = processor.apply_chat_template(
299
+ messages, tokenize=False, add_generation_prompt=True
300
+ )
301
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
302
+ inputs = processor(
303
+ text=[text],
304
+ images=image_inputs,
305
+ videos=video_inputs,
306
+ fps=fps,
307
+ padding=True,
308
+ return_tensors="pt",
309
+ **video_kwargs,
310
+ )
311
+ inputs = inputs.to("cuda")
312
+
313
+ # Inference
314
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
315
+ generated_ids_trimmed = [
316
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
317
+ ]
318
+ output_text = processor.batch_decode(
319
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
320
+ )
321
+ print(output_text)
322
+ ```
323
+
324
+ Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
325
+
326
+ | Backend | HTTP | HTTPS |
327
+ |-------------|------|-------|
328
+ | torchvision >= 0.19.0 | ✅ | ✅ |
329
+ | torchvision < 0.19.0 | ❌ | ❌ |
330
+ | decord | ✅ | ❌ |
331
+ </details>
332
+
333
+ <details>
334
+ <summary>Batch inference</summary>
335
+
336
+ ```python
337
+ # Sample messages for batch inference
338
+ messages1 = [
339
+ {
340
+ "role": "user",
341
+ "content": [
342
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
343
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
344
+ {"type": "text", "text": "What are the common elements in these pictures?"},
345
+ ],
346
+ }
347
+ ]
348
+ messages2 = [
349
+ {"role": "system", "content": "You are a helpful assistant."},
350
+ {"role": "user", "content": "Who are you?"},
351
+ ]
352
+ # Combine messages for batch processing
353
+ messages = [messages1, messages2]
354
+
355
+ # Preparation for batch inference
356
+ texts = [
357
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
358
+ for msg in messages
359
+ ]
360
+ image_inputs, video_inputs = process_vision_info(messages)
361
+ inputs = processor(
362
+ text=texts,
363
+ images=image_inputs,
364
+ videos=video_inputs,
365
+ padding=True,
366
+ return_tensors="pt",
367
+ )
368
+ inputs = inputs.to("cuda")
369
+
370
+ # Batch Inference
371
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
372
+ generated_ids_trimmed = [
373
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
374
+ ]
375
+ output_texts = processor.batch_decode(
376
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
377
+ )
378
+ print(output_texts)
379
+ ```
380
+ </details>
381
+
382
+ ### 🤖 ModelScope
383
+ We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
384
+
385
+
386
+ ### More Usage Tips
387
+
388
+ For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
389
+
390
+ ```python
391
+ # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
392
+ ## Local file path
393
+ messages = [
394
+ {
395
+ "role": "user",
396
+ "content": [
397
+ {"type": "image", "image": "file:///path/to/your/image.jpg"},
398
+ {"type": "text", "text": "Describe this image."},
399
+ ],
400
+ }
401
+ ]
402
+ ## Image URL
403
+ messages = [
404
+ {
405
+ "role": "user",
406
+ "content": [
407
+ {"type": "image", "image": "http://path/to/your/image.jpg"},
408
+ {"type": "text", "text": "Describe this image."},
409
+ ],
410
+ }
411
+ ]
412
+ ## Base64 encoded image
413
+ messages = [
414
+ {
415
+ "role": "user",
416
+ "content": [
417
+ {"type": "image", "image": "data:image;base64,/9j/..."},
418
+ {"type": "text", "text": "Describe this image."},
419
+ ],
420
+ }
421
+ ]
422
+ ```
423
+ #### Image Resolution for performance boost
424
+
425
+ The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
426
+
427
+ ```python
428
+ min_pixels = 256 * 28 * 28
429
+ max_pixels = 1280 * 28 * 28
430
+ processor = AutoProcessor.from_pretrained(
431
+ "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
432
+ )
433
+ ```
434
+
435
+ Besides, We provide two methods for fine-grained control over the image size input to the model:
436
+
437
+ 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
438
+
439
+ 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
440
+
441
+ ```python
442
+ # min_pixels and max_pixels
443
+ messages = [
444
+ {
445
+ "role": "user",
446
+ "content": [
447
+ {
448
+ "type": "image",
449
+ "image": "file:///path/to/your/image.jpg",
450
+ "resized_height": 280,
451
+ "resized_width": 420,
452
+ },
453
+ {"type": "text", "text": "Describe this image."},
454
+ ],
455
+ }
456
+ ]
457
+ # resized_height and resized_width
458
+ messages = [
459
+ {
460
+ "role": "user",
461
+ "content": [
462
+ {
463
+ "type": "image",
464
+ "image": "file:///path/to/your/image.jpg",
465
+ "min_pixels": 50176,
466
+ "max_pixels": 50176,
467
+ },
468
+ {"type": "text", "text": "Describe this image."},
469
+ ],
470
+ }
471
+ ]
472
+ ```
473
+
474
+ ### Processing Long Texts
475
+
476
+ The current `config.json` is set for context length up to 32,768 tokens.
477
+ To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
478
+
479
+ For supported frameworks, you could add the following to `config.json` to enable YaRN:
480
+
481
+ ```
482
+ {
483
+ ...,
484
+ "type": "yarn",
485
+ "mrope_section": [
486
+ 16,
487
+ 24,
488
+ 24
489
+ ],
490
+ "factor": 4,
491
+ "original_max_position_embeddings": 32768
492
+ }
493
+ ```
494
+
495
+ However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
496
+
497
+ At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
498
+
499
+
500
+
501
+ ## Citation
502
+
503
+ If you find our work helpful, feel free to give us a cite.
504
+
505
+ ```
506
+ @misc{qwen2.5-VL,
507
+ title = {Qwen2.5-VL},
508
+ url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
509
+ author = {Qwen Team},
510
+ month = {January},
511
+ year = {2025}
512
+ }
513
+
514
+ @article{Qwen2VL,
515
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
516
+ author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
517
+ journal={arXiv preprint arXiv:2409.12191},
518
+ year={2024}
519
+ }
520
+
521
+ @article{Qwen-VL,
522
+ title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
523
+ author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
524
+ journal={arXiv preprint arXiv:2308.12966},
525
+ year={2023}
526
+ }
527
+ ```
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/b318d0972a0bfe62f64d49e1fe95e4fd39292261",
3
+ "architectures": [
4
+ "Qwen2_5_VLForConditionalGeneration"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "eos_token_id": 151645,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 2048,
11
+ "image_token_id": 151655,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 11008,
14
+ "max_position_embeddings": 128000,
15
+ "max_window_layers": 70,
16
+ "model_type": "qwen2_5_vl",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 36,
19
+ "num_key_value_heads": 2,
20
+ "quantization_config": {
21
+ "bits": 4,
22
+ "group_size": 128,
23
+ "modules_to_not_convert": [
24
+ "visual"
25
+ ],
26
+ "quant_method": "awq",
27
+ "version": "gemm",
28
+ "zero_point": true
29
+ },
30
+ "rms_norm_eps": 1e-06,
31
+ "rope_scaling": {
32
+ "mrope_section": [
33
+ 16,
34
+ 24,
35
+ 24
36
+ ],
37
+ "rope_type": "default",
38
+ "type": "default"
39
+ },
40
+ "rope_theta": 1000000.0,
41
+ "sliding_window": 32768,
42
+ "tie_word_embeddings": true,
43
+ "torch_dtype": "bfloat16",
44
+ "transformers_version": "4.49.0.dev0",
45
+ "use_cache": true,
46
+ "use_sliding_window": false,
47
+ "video_token_id": 151656,
48
+ "vision_config": {
49
+ "hidden_size": 1280,
50
+ "in_chans": 3,
51
+ "model_type": "qwen2_5_vl",
52
+ "out_hidden_size": 2048,
53
+ "spatial_patch_size": 14,
54
+ "tokens_per_second": 2
55
+ },
56
+ "vision_end_token_id": 151653,
57
+ "vision_start_token_id": 151652,
58
+ "vision_token_id": 151654,
59
+ "vocab_size": 151936
60
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.1,
11
+ "top_k": 1,
12
+ "top_p": 0.001,
13
+ "transformers_version": "4.49.0.dev0"
14
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02a25c2c48fc49f6811a7ebf1b2ab58a6e16fc6399ff64425ca182191f7f6ac5
3
+ size 3401785760
preprocessor_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "Qwen2_5_VLImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "max_pixels": 12845056,
18
+ "merge_size": 2,
19
+ "min_pixels": 3136,
20
+ "patch_size": 14,
21
+ "processor_class": "Qwen2_5_VLProcessor",
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 12845056,
26
+ "shortest_edge": 3136
27
+ },
28
+ "temporal_patch_size": 2
29
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "model_max_length": 131072,
204
+ "pad_token": "<|endoftext|>",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2Tokenizer",
207
+ "unk_token": null
208
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff