creativeideaspal commited on
Commit
1c02aa7
·
0 Parent(s):

Initial import of Qari-OCR files

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit
4
+ language:
5
+ - ar
6
+ library_name: peft
7
+ license: apache-2.0
8
+ metrics:
9
+ - bleu
10
+ - wer
11
+ - cer
12
+ pipeline_tag: image-text-to-text
13
+ tags:
14
+ - transformers
15
+ - unsloth
16
+ - qwen2_vl
17
+ - trl
18
+ - ocr
19
+ ---
20
+
21
+ # Qari-OCR-Arabic-0.2.2.1-VL-2B-Instruct Model
22
+
23
+ This is the model described in the paper [QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation](https://huggingface.co/papers/2506.02295).
24
+
25
+ ## Model Overview
26
+
27
+ This model is a fine-tuned version of [unsloth/Qwen2-VL-2B-Instruct](https://huggingface.co/unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit) on an Arabic OCR dataset. It is optimized to perform Arabic Optical Character Recognition (OCR) for full-page text.
28
+
29
+ Code can be found at this url .
30
+
31
+ ## Key Features
32
+
33
+ - **Superior Accuracy**: Achieves state-of-the-art performance metrics for Arabic OCR
34
+ - **Diacritics Support**: Full recognition of Arabic diacritical marks (tashkeel) including fatḥah, kasrah, ḍammah, sukūn, shadda, and tanwin forms - a strength confirmed by evaluation on a primarily diacritical text dataset
35
+ - **Multiple Font Support**: Works across a variety of Arabic font styles
36
+ - **Layout Flexibility**: Handles different document layouts and formats
37
+
38
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/HuUcfziXcDT_2kwDoz5qH.png)
39
+
40
+ ## Model Details
41
+ - **Base Model**: Qwen2 VL
42
+ - **Fine-tuning Dataset**: Arabic OCR dataset
43
+ - **Objective**: Extract full-page Arabic text with high accuracy
44
+ - **Languages**: Arabic
45
+ - **Tasks**: OCR (Optical Character Recognition)
46
+ - **Dataset size**: 50,000 records
47
+ - **Epochs**: 1
48
+
49
+ ## Evaluation Metrics
50
+ Performance is evaluated using three standard metrics:
51
+ - **Word Error Rate (WER)**: Measures word-level accuracy (lower is better)
52
+ - **Character Error Rate (CER)**: Measures character-level accuracy (lower is better)
53
+ - **BLEU Score**: Measures overall translation quality (higher is better)
54
+
55
+ ### Results
56
+
57
+ | Model | WER ↓ | CER ↓ | BLEU ↑ |
58
+ |-------|-------|-------|--------|
59
+ | **Qari-OCR-0.2.2.1-VL-2B-Instruct** | **0.221** | **0.059** | **0.597** |
60
+ | AIN 8B | 0.757 | 0.309 | 0.103 |
61
+ | Qari-OCR-0.1-VL-2B-Instruct | 1.294 | 0.770 | 0.022 |
62
+ | easyOCR | 1.004 | 0.648 | 0.005 |
63
+ | pytesseract | 0.990 | 0.911 | <0.001 |
64
+
65
+ ### WER Comparison
66
+
67
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/Artnw-bVJuSaO_vnLeupE.png" height="600px"/>
68
+
69
+ ### CER Comparison
70
+
71
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/GihjVBk32SCyFCpJ81AEX.png" height="600px"/>
72
+
73
+ ### BLEU Score Comparison
74
+
75
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/HOOlFw5l_Os3dyyKmTXUs.png" height="600px"/>
76
+
77
+ ## Model Details
78
+
79
+ ### Training Data
80
+ The model was trained using the following specifications:
81
+
82
+ - **Font Sizes**: 14, 16, 18, 20, 24, 32, 40 pt
83
+ - **Page Layouts**:
84
+ - A4 (210mm × 297mm)
85
+ - Letter (216mm × 279mm)
86
+ - Small (105mm × 148mm)
87
+ - Square (1080px × 1080px)
88
+ - OneLine (210mm × 10mm)
89
+ - **Arabic Fonts Used**:
90
+ - IBM Plex Sans Arabic
91
+ - KFGQPCUthman Taha Naskh
92
+ - Scheherazade New
93
+ - Amiri
94
+ - Madina
95
+ - Diwani Letter
96
+ - Tajawal
97
+ - Cairo
98
+ - Lateef
99
+ - Almarai
100
+ - AlQalam Quran
101
+ - Noto Naskh Arabic
102
+
103
+ ### Limitations
104
+ Based on the training specifications, the model has the following limitations:
105
+
106
+ 1. **Font Size Constraints**: May have reduced accuracy with very small (< 14pt) or very large (> 40pt) text
107
+ 2. **Font Coverage**: Performance may degrade on uncommon Arabic fonts not represented in the training data
108
+ 3. **Diacritics Complexity**: While the model supports diacritics (tashkeel), extremely dense or unconventional diacritical mark combinations may reduce accuracy
109
+ 4. **Layout Sensitivity**: May have difficulty with complex multi-column layouts or unconventional page formats
110
+ 4. **Handwriting Recognition**: Limited capability with handwritten text as training focused on digital fonts
111
+ 5. **Decorative Text**: May struggle with highly stylized or decorative Arabic calligraphy
112
+ 6. **Background Complexity**: Optimized for clear backgrounds; performance may degrade with complex or textured backgrounds
113
+ 7. **Text Degradation**: May have challenges with severely degraded, blurry, or low-resolution text
114
+ 8. **Non-standard Orientations**: Primarily designed for horizontally oriented text; may struggle with vertical or diagonal text
115
+
116
+ ### Evaluation Method
117
+ Evaluation was performed on a diverse dataset of Arabic text images, **primarily featuring diacritical marks (tashkeel)**, measuring:
118
+ - **Word Error Rate (WER)**: The percentage of incorrectly recognized words
119
+ - **Character Error Rate (CER)**: The percentage of incorrectly recognized characters
120
+ - **BLEU Score**: A measure of translation quality, higher scores indicate better overall text recognition
121
+
122
+ ## How to Use
123
+
124
+ [Try Qari v0.2.2.1 - Google Colab](https://colab.research.google.com/github/NAMAA-ORG/public-notebooks/blob/main/Qari_V0_2_2_1_Free_Colab_updated.ipynb)
125
+
126
+ You can load this model using the `transformers` and `qwen_vl_utils` library:
127
+ ```
128
+ !pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
129
+ !pip install -U bitsandbytes
130
+ ```
131
+
132
+ ```python
133
+ from PIL import Image
134
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
135
+ import torch
136
+ import os
137
+ from qwen_vl_utils import process_vision_info
138
+
139
+ model_name = "NAMAA-Space/Qari-OCR-0.2.2.1-Arabic-2B-Instruct"
140
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
141
+ model_name,
142
+ torch_dtype="auto",
143
+ device_map="auto"
144
+ )
145
+ processor = AutoProcessor.from_pretrained(model_name)
146
+ max_tokens = 2000
147
+
148
+ prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
149
+ image.save("image.png")
150
+
151
+ messages = [
152
+ {
153
+ "role": "user",
154
+ "content": [
155
+ {"type": "image", "image": f"file://{src}"},
156
+ {"type": "text", "text": prompt},
157
+ ],
158
+ }
159
+ ]
160
+ text = processor.apply_chat_template(
161
+ messages, tokenize=False, add_generation_prompt=True
162
+ )
163
+ image_inputs, video_inputs = process_vision_info(messages)
164
+ inputs = processor(
165
+ text=[text],
166
+ images=image_inputs,
167
+ videos=video_inputs,
168
+ padding=True,
169
+ return_tensors="pt",
170
+ )
171
+ inputs = inputs.to("cuda")
172
+ generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
173
+ generated_ids_trimmed = [
174
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
175
+ ]
176
+ output_text = processor.batch_decode(
177
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
178
+ )[0]
179
+ os.remove(src)
180
+ print(output_text)
181
+ ```
182
+
183
+ ## License
184
+ This model follows the licensing terms of the original Qwen2 VL model. Please review the terms before using it commercially.
185
+
186
+ ## Citation
187
+
188
+ If you use this model in your research, please cite:
189
+
190
+ ```
191
+ @article{wasfy2025qari,
192
+ title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
193
+ author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
194
+ journal={arXiv preprint arXiv:2506.02295},
195
+ year={2025}
196
+ }
197
+ ```
adapter_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "unsloth/qwen2-vl-2b-instruct-unsloth-bnb-4bit",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 16,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "r": 16,
24
+ "rank_pattern": {},
25
+ "revision": null,
26
+ "target_modules": "(?:.*?(?:vision|image|visual|patch|language|text).*?(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense).*?(?:qkv|proj|fc1|fc2|q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj).*?)|(?:\\bmodel\\.layers\\.[\\d]{1,}\\.(?:self_attn|attention|attn|mlp|feed_forward|ffn|dense)\\.(?:(?:qkv|proj|fc1|fc2|q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)))",
27
+ "task_type": "CAUSAL_LM",
28
+ "trainable_token_indices": null,
29
+ "use_dora": false,
30
+ "use_rslora": false
31
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48c56052788cf3568708553918584d4a6e0c264dd860ddb8949b27d0953987e2
3
+ size 115886968
added_tokens.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|box_end|>": 151649,
3
+ "<|box_start|>": 151648,
4
+ "<|endoftext|>": 151643,
5
+ "<|im_end|>": 151645,
6
+ "<|im_start|>": 151644,
7
+ "<|image_pad|>": 151655,
8
+ "<|object_ref_end|>": 151647,
9
+ "<|object_ref_start|>": 151646,
10
+ "<|quad_end|>": 151651,
11
+ "<|quad_start|>": 151650,
12
+ "<|video_pad|>": 151656,
13
+ "<|vision_end|>": 151653,
14
+ "<|vision_pad|>": 151654,
15
+ "<|vision_start|>": 151652
16
+ }
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
3
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "Qwen2VLImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "max_pixels": 12845056,
18
+ "merge_size": 2,
19
+ "min_pixels": 3136,
20
+ "patch_size": 14,
21
+ "processor_class": "Qwen2VLProcessor",
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 12845056,
26
+ "shortest_edge": 3136
27
+ },
28
+ "temporal_patch_size": 2
29
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|vision_pad|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:948c45c29a91dd2e6ae77d6f5a324a3d408bcca6ad443365b2e79986f1422771
3
+ size 11420540
tokenizer_config.json ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "<|object_ref_start|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "<|object_ref_end|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<|box_start|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<|box_end|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<|quad_start|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "151651": {
69
+ "content": "<|quad_end|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "151652": {
77
+ "content": "<|vision_start|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "151653": {
85
+ "content": "<|vision_end|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "151654": {
93
+ "content": "<|vision_pad|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "151655": {
101
+ "content": "<|image_pad|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "151656": {
109
+ "content": "<|video_pad|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ }
116
+ },
117
+ "additional_special_tokens": [
118
+ "<|im_start|>",
119
+ "<|im_end|>",
120
+ "<|object_ref_start|>",
121
+ "<|object_ref_end|>",
122
+ "<|box_start|>",
123
+ "<|box_end|>",
124
+ "<|quad_start|>",
125
+ "<|quad_end|>",
126
+ "<|vision_start|>",
127
+ "<|vision_end|>",
128
+ "<|vision_pad|>",
129
+ "<|image_pad|>",
130
+ "<|video_pad|>"
131
+ ],
132
+ "bos_token": null,
133
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
134
+ "clean_up_tokenization_spaces": false,
135
+ "eos_token": "<|im_end|>",
136
+ "errors": "replace",
137
+ "extra_special_tokens": {},
138
+ "model_max_length": 32768,
139
+ "pad_token": "<|vision_pad|>",
140
+ "padding_side": "right",
141
+ "processor_class": "Qwen2VLProcessor",
142
+ "split_special_tokens": false,
143
+ "tokenizer_class": "Qwen2Tokenizer",
144
+ "unk_token": null
145
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17fbfb7a9acefafd4f5ff7e2dd0a83c6914db946c80d6d26cf93b12237853b17
3
+ size 5624
vocab.json ADDED
The diff for this file is too large to render. See raw diff