Visual Question Answering
Transformers
Safetensors
internvl_chat
feature-extraction
custom_code
8-bit precision
bitsandbytes
failspy commited on
Commit
5014aa2
1 Parent(s): 0a30a85

Upload 17 files

Browse files
README.md CHANGED
@@ -1,199 +1,270 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
 
 
 
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
 
 
 
 
11
 
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
 
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
 
131
- #### Summary
 
 
 
 
 
 
132
 
 
133
 
 
 
 
 
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
 
 
 
 
 
 
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
 
 
 
 
 
 
 
 
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
+ pipeline_tag: visual-question-answering
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1.5
14
+ <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
16
+ </p>
17
 
18
+ > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
+ We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
23
+ We introduce three simple designs:
24
+ 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
25
+ 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
26
+ 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ ## Model Details
30
+ - **Model Type:** multimodal large language model (MLLM)
31
+ - **Model Stats:**
32
+ - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
33
+ - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
34
+ - Params: 25.5B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ - **Training Strategy:**
37
+ - Pretraining Stage
38
+ - Learnable Component: ViT + MLP
39
+ - Data: Please see our technical report.
40
+ - SFT Stage
41
+ - Learnable Component: ViT + MLP + LLM
42
+ - Data: Please see our technical report.
43
 
44
+ ## Released Models
45
 
46
+ | Model | Vision Foundation Model | Release Date |Note |
47
+ | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
48
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
49
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
50
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
51
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
52
 
53
+ ## Performance
54
 
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ZyQklQ3C7C60I-xOv7X8L.png)
56
 
 
57
 
58
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/u98oqlnpZtWdq2dnarVlD.png)
59
 
60
+ ## Examples
61
 
62
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/R34jISP4K1U17m9yNP38O.png)
63
 
64
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ChkU9XtlsjH0l2EqlO_is.png)
 
 
 
 
65
 
66
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/1TFxIcf96ANRPLoy4-rbh.png)
67
 
68
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/Wpjo1Sdwf7XcEDevqwcr-.png)
69
 
70
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/kO4-J38sN8TFtmQ5mIBMS.png)
71
 
72
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/qPnTe3Q9UBy8wbclOsmWk.png)
73
 
74
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/l_BILRi13CbZNzbZYn6o6.png)
75
 
76
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2782y7RnvGBogYEIG__7S.png)
77
 
78
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/RyO35PTH14OFiwyxtAZM2.png)
79
 
80
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/xiLZXWL-JiCTVPnV_VxS2.png)
81
 
82
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
83
 
 
84
 
 
85
 
86
+ ## Model Usage
87
 
88
+ We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
89
 
90
+ You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
91
 
92
+ ```python
93
+ import json
94
+ import os
95
+ from transformers import AutoTokenizer, AutoModel
96
+ from tqdm import tqdm
97
+ import torch
98
+ import torchvision.transforms as T
99
+ from PIL import Image
100
 
101
+ from torchvision.transforms.functional import InterpolationMode
102
 
 
103
 
104
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
105
+ IMAGENET_STD = (0.229, 0.224, 0.225)
106
 
 
107
 
108
+ def build_transform(input_size):
109
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
110
+ transform = T.Compose([
111
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
112
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
113
+ T.ToTensor(),
114
+ T.Normalize(mean=MEAN, std=STD)
115
+ ])
116
+ return transform
117
 
 
118
 
119
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
120
+ best_ratio_diff = float('inf')
121
+ best_ratio = (1, 1)
122
+ area = width * height
123
+ for ratio in target_ratios:
124
+ target_aspect_ratio = ratio[0] / ratio[1]
125
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
126
+ if ratio_diff < best_ratio_diff:
127
+ best_ratio_diff = ratio_diff
128
+ best_ratio = ratio
129
+ elif ratio_diff == best_ratio_diff:
130
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
131
+ best_ratio = ratio
132
+ return best_ratio
133
 
 
134
 
135
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
136
+ orig_width, orig_height = image.size
137
+ aspect_ratio = orig_width / orig_height
138
+
139
+ # calculate the existing image aspect ratio
140
+ target_ratios = set(
141
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
142
+ i * j <= max_num and i * j >= min_num)
143
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
144
+
145
+ # find the closest aspect ratio to the target
146
+ target_aspect_ratio = find_closest_aspect_ratio(
147
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
148
+
149
+ # calculate the target width and height
150
+ target_width = image_size * target_aspect_ratio[0]
151
+ target_height = image_size * target_aspect_ratio[1]
152
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
153
+
154
+ # resize the image
155
+ resized_img = image.resize((target_width, target_height))
156
+ processed_images = []
157
+ for i in range(blocks):
158
+ box = (
159
+ (i % (target_width // image_size)) * image_size,
160
+ (i // (target_width // image_size)) * image_size,
161
+ ((i % (target_width // image_size)) + 1) * image_size,
162
+ ((i // (target_width // image_size)) + 1) * image_size
163
+ )
164
+ # split the image
165
+ split_img = resized_img.crop(box)
166
+ processed_images.append(split_img)
167
+ assert len(processed_images) == blocks
168
+ if use_thumbnail and len(processed_images) != 1:
169
+ thumbnail_img = image.resize((image_size, image_size))
170
+ processed_images.append(thumbnail_img)
171
+ return processed_images
172
+
173
+
174
+ def load_image(image_file, input_size=448, max_num=6):
175
+ image = Image.open(image_file).convert('RGB')
176
+ transform = build_transform(input_size=input_size)
177
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
178
+ pixel_values = [transform(image) for image in images]
179
+ pixel_values = torch.stack(pixel_values)
180
+ return pixel_values
181
+
182
+
183
+ path = "OpenGVLab/InternVL-Chat-V1-5"
184
+ # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
185
+ model = AutoModel.from_pretrained(
186
+ path,
187
+ torch_dtype=torch.bfloat16,
188
+ low_cpu_mem_usage=True,
189
+ trust_remote_code=True).eval().cuda()
190
+ # Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
191
+ # model = AutoModel.from_pretrained(
192
+ # path,
193
+ # torch_dtype=torch.bfloat16,
194
+ # low_cpu_mem_usage=True,
195
+ # trust_remote_code=True,
196
+ # device_map='auto').eval()
197
+
198
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
199
+ # set the max number of tiles in `max_num`
200
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
201
+
202
+ generation_config = dict(
203
+ num_beams=1,
204
+ max_new_tokens=512,
205
+ do_sample=False,
206
+ )
207
+
208
+ # single-round single-image conversation
209
+ question = "请详细描述图片"
210
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
211
+ print(question, response)
212
+
213
+ # multi-round single-image conversation
214
+ question = "请详细描述图片"
215
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
216
+ print(question, response)
217
+
218
+ question = "请根据图片写一首诗"
219
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
220
+ print(question, response)
221
+
222
+ # multi-round multi-image conversation
223
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
224
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
225
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
226
+
227
+ question = "详细描述这两张图片"
228
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
229
+ print(question, response)
230
+ # 第一张图片是一只红熊猫,它有着独特的橙红色皮毛,脸部、耳朵和四肢的末端有白色斑块。红熊猫的眼睛周围有深色的环,它的耳朵是圆形的,上面有白色的毛。它正坐在一个木制的结构上,看起来像是一个平台或休息的地方。背景中有树木和竹子,这表明红熊猫可能在一个模拟自然环境的动物园或保护区内。
231
+ #
232
+ # 第二张图片是一只大熊猫,它是中国的国宝,以其黑白相间的皮毛而闻名。大熊猫的眼睛、耳朵和四肢的末端是黑色的,而它的脸部、耳朵内侧和身体其他部分是白色的。大熊猫正坐在地上,周围有竹子,这是它们的主要食物来源。背景中也有树木,这表明大熊猫可能在一个为它们提供自然栖息地模拟的动物园或保护区内。
233
+
234
+ question = "这两张图片的相同点和区别分别是什么"
235
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
236
+ print(question, response)
237
+ # 这两张图片的相同点:
238
+ #
239
+ # 1. 都展示了熊猫,这是两种不同的熊猫物种。
240
+ # 2. 熊猫都处于一个看起来像是模拟自然环境的场所,可能是动物园或保护区。
241
+ # 3. 熊猫周围都有竹子,这是它们的主要食物来源。
242
+ #
243
+ # 这两张图片的区别:
244
+ #
245
+ # 1. 熊猫的种类不同:第一张图片是一只红熊猫,第二张图片是一只大熊猫。
246
+ # 2. 熊猫的皮毛颜色和图案不同:红熊猫的皮毛是橙红色,脸部、耳朵和四肢的末端有白色斑块;而大熊猫的皮毛是黑白相间的,眼睛、耳朵和四肢的末端是黑色的,脸部、耳朵内侧和身体其他部分是白色的。
247
+ # 3. 熊猫的姿态和位置不同:红熊猫坐在一个木制的结构上,而大熊猫坐在地上。
248
+ # 4. 背景中的植被和环境细节略有不同,但都包含树木和竹子。
249
+ ```
250
+
251
+ ## Citation
252
+
253
+ If you find this project useful in your research, please consider citing:
254
+
255
+ ```BibTeX
256
+ @article{chen2023internvl,
257
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
258
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
259
+ journal={arXiv preprint arXiv:2312.14238},
260
+ year={2023}
261
+ }
262
+ ```
263
+
264
+ ## License
265
+
266
+ This project is released under the MIT license.
267
+
268
+ ## Acknowledgement
269
+
270
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
added_tokens.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</box>": 92552,
3
+ "</img>": 92545,
4
+ "</quad>": 92548,
5
+ "</ref>": 92550,
6
+ "<IMG_CONTEXT>": 92546,
7
+ "<box>": 92551,
8
+ "<img>": 92544,
9
+ "<quad>": 92547,
10
+ "<ref>": 92549
11
+ }
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.8170236018231988,
4
+ "train_runtime": 190400.5325,
5
+ "train_samples": 5155291,
6
+ "train_samples_per_second": 27.076,
7
+ "train_steps_per_second": 0.026
8
+ }
example_inference.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModel
2
+ import torch
3
+ import json
4
+ import os
5
+ from tqdm import tqdm
6
+ import torchvision.transforms as T
7
+ from PIL import Image
8
+ from torchvision.transforms.functional import InterpolationMode
9
+ from time import sleep
10
+
11
+
12
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
13
+ IMAGENET_STD = (0.229, 0.224, 0.225)
14
+
15
+ def build_transform(input_size):
16
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
17
+ transform = T.Compose([
18
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
19
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
20
+ T.ToTensor(),
21
+ T.Normalize(mean=MEAN, std=STD)
22
+ ])
23
+ return transform
24
+
25
+
26
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
27
+ best_ratio_diff = float('inf')
28
+ best_ratio = (1, 1)
29
+ area = width * height
30
+ for ratio in target_ratios:
31
+ target_aspect_ratio = ratio[0] / ratio[1]
32
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
33
+ if ratio_diff < best_ratio_diff:
34
+ best_ratio_diff = ratio_diff
35
+ best_ratio = ratio
36
+ elif ratio_diff == best_ratio_diff:
37
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
38
+ best_ratio = ratio
39
+ return best_ratio
40
+
41
+
42
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
43
+ orig_width, orig_height = image.size
44
+ aspect_ratio = orig_width / orig_height
45
+
46
+ # calculate the existing image aspect ratio
47
+ target_ratios = set(
48
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
49
+ i * j <= max_num and i * j >= min_num)
50
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
51
+
52
+ # find the closest aspect ratio to the target
53
+ target_aspect_ratio = find_closest_aspect_ratio(
54
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
55
+
56
+ # calculate the target width and height
57
+ target_width = image_size * target_aspect_ratio[0]
58
+ target_height = image_size * target_aspect_ratio[1]
59
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
60
+
61
+ # resize the image
62
+ resized_img = image.resize((target_width, target_height))
63
+ processed_images = []
64
+ for i in range(blocks):
65
+ box = (
66
+ (i % (target_width // image_size)) * image_size,
67
+ (i // (target_width // image_size)) * image_size,
68
+ ((i % (target_width // image_size)) + 1) * image_size,
69
+ ((i // (target_width // image_size)) + 1) * image_size
70
+ )
71
+ # split the image
72
+ split_img = resized_img.crop(box)
73
+ processed_images.append(split_img)
74
+ assert len(processed_images) == blocks
75
+ if use_thumbnail and len(processed_images) != 1:
76
+ thumbnail_img = image.resize((image_size, image_size))
77
+ processed_images.append(thumbnail_img)
78
+ return processed_images
79
+
80
+
81
+ def load_image(image_file, input_size=448, max_num=6):
82
+ return transform_img(Image.open(image_file))
83
+
84
+ def transform_img(image, input_size=448, max_num=6):
85
+ image = image.convert('RGB')
86
+ transform = build_transform(input_size=input_size)
87
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
88
+ pixel_values = [transform(image) for image in images]
89
+ pixel_values = torch.stack(pixel_values)
90
+ return pixel_values
91
+
92
+ model = AutoModel.from_pretrained(
93
+ "failspy/InternVL-Chat-V1-5-8bit",
94
+ torch_dtype=torch.bfloat16,
95
+ trust_remote_code=True,
96
+ device_map='auto').eval()
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained("failspy/InternVL-Chat-V1-5-8bit", trust_remote_code=True)
99
+
100
+ pixel_values = load_image("./TestImage.jpg").to(torch.bfloat16).cuda()
101
+
102
+ generation_config = dict(num_beams=1, max_new_tokens=512, do_sample=False)
103
+ question = "What is shown in this image?"
104
+ response = model.chat(tokenizer,pixel_values,question,generation_config)
generation_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
  "_from_model_config": true,
3
- "transformers_version": "4.39.2"
4
  }
 
1
  {
2
  "_from_model_config": true,
3
+ "transformers_version": "4.36.2"
4
  }
special_tokens_map.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<img>",
4
+ "</img>",
5
+ "<IMG_CONTEXT>",
6
+ "<quad>",
7
+ "</quad>",
8
+ "<ref>",
9
+ "</ref>",
10
+ "<box>",
11
+ "</box>"
12
+ ],
13
+ "bos_token": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "eos_token": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "pad_token": {
28
+ "content": "</s>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ },
34
+ "unk_token": {
35
+ "content": "<unk>",
36
+ "lstrip": false,
37
+ "normalized": false,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ }
41
+ }
tokenization_internlm2.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/tokenization_llama.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ """Tokenization classes for InternLM."""
18
+ import os
19
+ from shutil import copyfile
20
+ from typing import Any, Dict, List, Optional, Tuple
21
+
22
+ import sentencepiece as spm
23
+ from transformers.tokenization_utils import PreTrainedTokenizer
24
+ from transformers.utils import logging
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ VOCAB_FILES_NAMES = {'vocab_file': './tokenizer.model'}
29
+
30
+ PRETRAINED_VOCAB_FILES_MAP = {}
31
+
32
+
33
+ # Modified from transformers.model.llama.tokenization_llama.LlamaTokenizer
34
+ class InternLM2Tokenizer(PreTrainedTokenizer):
35
+ """
36
+ Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
37
+
38
+ Args:
39
+ vocab_file (`str`):
40
+ Path to the vocabulary file.
41
+ """
42
+
43
+ vocab_files_names = VOCAB_FILES_NAMES
44
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
45
+ model_input_names = ['input_ids', 'attention_mask']
46
+ _auto_class = 'AutoTokenizer'
47
+
48
+ def __init__(
49
+ self,
50
+ vocab_file,
51
+ unk_token='<unk>',
52
+ bos_token='<s>',
53
+ eos_token='</s>',
54
+ pad_token='</s>',
55
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
56
+ add_bos_token=True,
57
+ add_eos_token=False,
58
+ decode_with_prefix_space=False,
59
+ clean_up_tokenization_spaces=False,
60
+ **kwargs,
61
+ ):
62
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
63
+ self.vocab_file = vocab_file
64
+ self.add_bos_token = add_bos_token
65
+ self.add_eos_token = add_eos_token
66
+ self.decode_with_prefix_space = decode_with_prefix_space
67
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
68
+ self.sp_model.Load(vocab_file)
69
+ self._no_prefix_space_tokens = None
70
+ super().__init__(
71
+ bos_token=bos_token,
72
+ eos_token=eos_token,
73
+ unk_token=unk_token,
74
+ pad_token=pad_token,
75
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
76
+ **kwargs,
77
+ )
78
+
79
+ @property
80
+ def no_prefix_space_tokens(self):
81
+ if self._no_prefix_space_tokens is None:
82
+ vocab = self.convert_ids_to_tokens(list(range(self.vocab_size)))
83
+ self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith('▁')}
84
+ return self._no_prefix_space_tokens
85
+
86
+ @property
87
+ def vocab_size(self):
88
+ """Returns vocab size"""
89
+ return self.sp_model.get_piece_size()
90
+
91
+ @property
92
+ def bos_token_id(self) -> Optional[int]:
93
+ return self.sp_model.bos_id()
94
+
95
+ @property
96
+ def eos_token_id(self) -> Optional[int]:
97
+ return self.sp_model.eos_id()
98
+
99
+ def get_vocab(self):
100
+ """Returns vocab as a dict"""
101
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
102
+ vocab.update(self.added_tokens_encoder)
103
+ return vocab
104
+
105
+ def _tokenize(self, text):
106
+ """Returns a tokenized string."""
107
+ return self.sp_model.encode(text, out_type=str)
108
+
109
+ def _convert_token_to_id(self, token):
110
+ """Converts a token (str) in an id using the vocab."""
111
+ return self.sp_model.piece_to_id(token)
112
+
113
+ def _convert_id_to_token(self, index):
114
+ """Converts an index (integer) in a token (str) using the vocab."""
115
+ token = self.sp_model.IdToPiece(index)
116
+ return token
117
+
118
+ def _maybe_add_prefix_space(self, tokens, decoded):
119
+ if tokens and tokens[0] not in self.no_prefix_space_tokens:
120
+ return ' ' + decoded
121
+ else:
122
+ return decoded
123
+
124
+ def convert_tokens_to_string(self, tokens):
125
+ """Converts a sequence of tokens (string) in a single string."""
126
+ current_sub_tokens = []
127
+ out_string = ''
128
+ prev_is_special = False
129
+ for token in tokens:
130
+ # make sure that special tokens are not decoded using sentencepiece model
131
+ if token in self.all_special_tokens:
132
+ if not prev_is_special:
133
+ out_string += ' '
134
+ out_string += self.sp_model.decode(current_sub_tokens) + token
135
+ prev_is_special = True
136
+ current_sub_tokens = []
137
+ else:
138
+ current_sub_tokens.append(token)
139
+ prev_is_special = False
140
+ out_string += self.sp_model.decode(current_sub_tokens)
141
+ out_string = self.clean_up_tokenization(out_string)
142
+ out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string)
143
+ return out_string[1:]
144
+
145
+ def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
146
+ """
147
+ Save the vocabulary and special tokens file to a directory.
148
+
149
+ Args:
150
+ save_directory (`str`):
151
+ The directory in which to save the vocabulary.
152
+
153
+ Returns:
154
+ `Tuple(str)`: Paths to the files saved.
155
+ """
156
+ if not os.path.isdir(save_directory):
157
+ logger.error(f'Vocabulary path ({save_directory}) should be a directory')
158
+ return
159
+ out_vocab_file = os.path.join(
160
+ save_directory, (filename_prefix + '-' if filename_prefix else '') + VOCAB_FILES_NAMES['vocab_file']
161
+ )
162
+
163
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
164
+ copyfile(self.vocab_file, out_vocab_file)
165
+ elif not os.path.isfile(self.vocab_file):
166
+ with open(out_vocab_file, 'wb') as fi:
167
+ content_spiece_model = self.sp_model.serialized_model_proto()
168
+ fi.write(content_spiece_model)
169
+
170
+ return (out_vocab_file,)
171
+
172
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
173
+ if self.add_bos_token:
174
+ bos_token_ids = [self.bos_token_id]
175
+ else:
176
+ bos_token_ids = []
177
+
178
+ output = bos_token_ids + token_ids_0
179
+
180
+ if token_ids_1 is not None:
181
+ output = output + token_ids_1
182
+
183
+ if self.add_eos_token:
184
+ output = output + [self.eos_token_id]
185
+
186
+ return output
187
+
188
+ def get_special_tokens_mask(
189
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
190
+ ) -> List[int]:
191
+ """
192
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
193
+ special tokens using the tokenizer `prepare_for_model` method.
194
+
195
+ Args:
196
+ token_ids_0 (`List[int]`):
197
+ List of IDs.
198
+ token_ids_1 (`List[int]`, *optional*):
199
+ Optional second list of IDs for sequence pairs.
200
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
201
+ Whether or not the token list is already formatted with special tokens for the model.
202
+
203
+ Returns:
204
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
205
+ """
206
+ if already_has_special_tokens:
207
+ return super().get_special_tokens_mask(
208
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
209
+ )
210
+
211
+ if token_ids_1 is None:
212
+ return [1] + ([0] * len(token_ids_0)) + [1]
213
+ return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
214
+
215
+ def create_token_type_ids_from_sequences(
216
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
217
+ ) -> List[int]:
218
+ """
219
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
220
+ use of token type ids, therefore a list of zeros is returned.
221
+
222
+ Args:
223
+ token_ids_0 (`List[int]`):
224
+ List of IDs.
225
+ token_ids_1 (`List[int]`, *optional*):
226
+ Optional second list of IDs for sequence pairs.
227
+
228
+ Returns:
229
+ `List[int]`: List of zeros.
230
+ """
231
+ eos = [self.eos_token_id]
232
+
233
+ if token_ids_1 is None:
234
+ return len(token_ids_0 + eos) * [0]
235
+ return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
tokenization_internlm2_fast.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/tokenization_llama_fast.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ """Tokenization Fast class for InternLM."""
18
+ import os
19
+ from shutil import copyfile
20
+ from typing import Any, Dict, Optional, Tuple
21
+
22
+ from tokenizers import Tokenizer, decoders, normalizers, processors
23
+ from tokenizers.models import BPE
24
+ from transformers.convert_slow_tokenizer import (SLOW_TO_FAST_CONVERTERS,
25
+ SentencePieceExtractor,
26
+ SpmConverter)
27
+ from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
28
+ from transformers.utils import logging
29
+
30
+ from .tokenization_internlm2 import InternLM2Tokenizer
31
+
32
+ logger = logging.get_logger(__name__)
33
+
34
+ VOCAB_FILES_NAMES = {'vocab_file': './tokenizer.model'}
35
+
36
+
37
+ # Modified from transformers.convert_slow_tokenizer.LlamaConverter
38
+ class InternLM2Converter(SpmConverter):
39
+ handle_byte_fallback = True
40
+
41
+ def vocab(self, proto):
42
+ vocab = [
43
+ ('<unk>', 0.0),
44
+ ('<s>', 0.0),
45
+ ('</s>', 0.0),
46
+ ]
47
+ vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
48
+ return vocab
49
+
50
+ def unk_id(self, proto):
51
+ unk_id = 0
52
+ return unk_id
53
+
54
+ def decoder(self, replacement, add_prefix_space):
55
+ return decoders.Sequence(
56
+ [
57
+ decoders.Replace('▁', ' '),
58
+ decoders.ByteFallback(),
59
+ decoders.Fuse(),
60
+ decoders.Strip(content=' ', left=1),
61
+ ]
62
+ )
63
+
64
+ def tokenizer(self, proto):
65
+ model_type = proto.trainer_spec.model_type
66
+ vocab_scores = self.vocab(proto)
67
+ # special tokens
68
+ added_tokens = self.original_tokenizer.added_tokens_decoder
69
+ for i in range(len(vocab_scores)):
70
+ piece, score = vocab_scores[i]
71
+ if i in added_tokens:
72
+ vocab_scores[i] = (added_tokens[i].content, score)
73
+ if model_type == 1:
74
+ raise RuntimeError('InternLM2 is supposed to be a BPE model!')
75
+
76
+ elif model_type == 2:
77
+ _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
78
+ bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
79
+ tokenizer = Tokenizer(
80
+ BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
81
+ )
82
+ tokenizer.add_special_tokens(
83
+ [ added_token for index, added_token in added_tokens.items()]
84
+ )
85
+ else:
86
+ raise Exception(
87
+ "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
88
+ )
89
+
90
+ return tokenizer
91
+
92
+ def normalizer(self, proto):
93
+ normalizers_list = []
94
+ if proto.normalizer_spec.add_dummy_prefix:
95
+ normalizers_list.append(normalizers.Prepend(prepend='▁'))
96
+ normalizers_list.append(normalizers.Replace(pattern=' ', content='▁'))
97
+ return normalizers.Sequence(normalizers_list)
98
+
99
+ def pre_tokenizer(self, replacement, add_prefix_space):
100
+ return None
101
+
102
+
103
+ SLOW_TO_FAST_CONVERTERS['InternLM2Tokenizer'] = InternLM2Converter
104
+
105
+
106
+ # Modified from transformers.model.llama.tokenization_llama_fast.LlamaTokenizerFast -> InternLM2TokenizerFast
107
+ class InternLM2TokenizerFast(PreTrainedTokenizerFast):
108
+ vocab_files_names = VOCAB_FILES_NAMES
109
+ slow_tokenizer_class = InternLM2Tokenizer
110
+ padding_side = 'left'
111
+ model_input_names = ['input_ids', 'attention_mask']
112
+ _auto_class = 'AutoTokenizer'
113
+
114
+ def __init__(
115
+ self,
116
+ vocab_file,
117
+ unk_token='<unk>',
118
+ bos_token='<s>',
119
+ eos_token='</s>',
120
+ pad_token='</s>',
121
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
122
+ add_bos_token=True,
123
+ add_eos_token=False,
124
+ decode_with_prefix_space=False,
125
+ clean_up_tokenization_spaces=False,
126
+ **kwargs,
127
+ ):
128
+ super().__init__(
129
+ vocab_file=vocab_file,
130
+ unk_token=unk_token,
131
+ bos_token=bos_token,
132
+ eos_token=eos_token,
133
+ pad_token=pad_token,
134
+ sp_model_kwargs=sp_model_kwargs,
135
+ add_bos_token=add_bos_token,
136
+ add_eos_token=add_eos_token,
137
+ decode_with_prefix_space=decode_with_prefix_space,
138
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
139
+ **kwargs,
140
+ )
141
+ self._add_bos_token = add_bos_token
142
+ self._add_eos_token = add_eos_token
143
+ self.update_post_processor()
144
+ self.vocab_file = vocab_file
145
+
146
+ @property
147
+ def can_save_slow_tokenizer(self) -> bool:
148
+ return os.path.isfile(self.vocab_file) if self.vocab_file else False
149
+
150
+ def update_post_processor(self):
151
+ """
152
+ Updates the underlying post processor with the current `bos_token` and `eos_token`.
153
+ """
154
+ bos = self.bos_token
155
+ bos_token_id = self.bos_token_id
156
+ if bos is None and self.add_bos_token:
157
+ raise ValueError('add_bos_token = True but bos_token = None')
158
+
159
+ eos = self.eos_token
160
+ eos_token_id = self.eos_token_id
161
+ if eos is None and self.add_eos_token:
162
+ raise ValueError('add_eos_token = True but eos_token = None')
163
+
164
+ single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
165
+ pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
166
+
167
+ special_tokens = []
168
+ if self.add_bos_token:
169
+ special_tokens.append((bos, bos_token_id))
170
+ if self.add_eos_token:
171
+ special_tokens.append((eos, eos_token_id))
172
+ self._tokenizer.post_processor = processors.TemplateProcessing(
173
+ single=single, pair=pair, special_tokens=special_tokens
174
+ )
175
+
176
+ @property
177
+ def add_eos_token(self):
178
+ return self._add_eos_token
179
+
180
+ @property
181
+ def add_bos_token(self):
182
+ return self._add_bos_token
183
+
184
+ @add_eos_token.setter
185
+ def add_eos_token(self, value):
186
+ self._add_eos_token = value
187
+ self.update_post_processor()
188
+
189
+ @add_bos_token.setter
190
+ def add_bos_token(self, value):
191
+ self._add_bos_token = value
192
+ self.update_post_processor()
193
+
194
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
195
+ if not self.can_save_slow_tokenizer:
196
+ raise ValueError(
197
+ 'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow '
198
+ 'tokenizer.'
199
+ )
200
+
201
+ if not os.path.isdir(save_directory):
202
+ logger.error(f'Vocabulary path ({save_directory}) should be a directory')
203
+ return
204
+ out_vocab_file = os.path.join(
205
+ save_directory, (filename_prefix + '-' if filename_prefix else '') + VOCAB_FILES_NAMES['vocab_file']
206
+ )
207
+
208
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
209
+ copyfile(self.vocab_file, out_vocab_file)
210
+
211
+ return (out_vocab_file,)
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f868398fc4e05ee1e8aeba95ddf18ddcc45b8bce55d5093bead5bbf80429b48b
3
+ size 1477754
tokenizer_config.json ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "92538": {
28
+ "content": "<|plugin|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "92539": {
36
+ "content": "<|interpreter|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "92540": {
44
+ "content": "<|action_end|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "92541": {
52
+ "content": "<|action_start|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "92542": {
60
+ "content": "<|im_end|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "92543": {
68
+ "content": "<|im_start|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "92544": {
76
+ "content": "<img>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "92545": {
84
+ "content": "</img>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "92546": {
92
+ "content": "<IMG_CONTEXT>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "92547": {
100
+ "content": "<quad>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "92548": {
108
+ "content": "</quad>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "92549": {
116
+ "content": "<ref>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "92550": {
124
+ "content": "</ref>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "92551": {
132
+ "content": "<box>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "92552": {
140
+ "content": "</box>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ }
147
+ },
148
+ "additional_special_tokens": [
149
+ "<img>",
150
+ "</img>",
151
+ "<IMG_CONTEXT>",
152
+ "<quad>",
153
+ "</quad>",
154
+ "<ref>",
155
+ "</ref>",
156
+ "<box>",
157
+ "</box>"
158
+ ],
159
+ "auto_map": {
160
+ "AutoTokenizer": [
161
+ "tokenization_internlm2.InternLM2Tokenizer",
162
+ null
163
+ ]
164
+ },
165
+ "bos_token": "<s>",
166
+ "chat_template": "{{ bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
167
+ "clean_up_tokenization_spaces": false,
168
+ "eos_token": "</s>",
169
+ "model_max_length": 4096,
170
+ "pad_token": "</s>",
171
+ "tokenizer_class": "InternLM2Tokenizer",
172
+ "unk_token": "<unk>"
173
+ }