.gitattributes CHANGED
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- examples/red-panda.mp4 filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
README.md CHANGED
@@ -1,186 +1,110 @@
1
  ---
2
  license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - OpenGVLab/InternViT-6B-448px-V1-5
7
- - internlm/internlm2-chat-20b
8
- base_model_relation: merge
9
- language:
10
- - multilingual
11
- tags:
12
- - internvl
13
- - vision
14
- - ocr
15
- - multi-image
16
- - video
17
- - custom_code
18
  ---
19
 
20
- # InternVL-Chat-V1-5
21
-
22
- [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
23
-
24
- [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
25
-
26
- ## Introduction
27
-
28
  <p align="center">
29
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300">
30
  </p>
31
 
32
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
33
 
34
- We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
35
 
36
- We introduce three simple designs:
 
 
 
 
37
 
38
- 1. **Strong Vision Encoder:** we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
39
- 2. **Dynamic High-Resolution:** we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input during inference.
40
- 3. **High-Quality Bilingual Dataset:** we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
41
 
42
  ## Model Details
43
-
44
  - **Model Type:** multimodal large language model (MLLM)
45
-
46
  - **Model Stats:**
47
-
48
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
49
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
50
  - Params: 25.5B
51
 
52
  - **Training Strategy:**
53
-
54
- - Learnable component in the pre-training stage: ViT + MLP
55
- - Learnable component in the fine-tuning stage: ViT + MLP + LLM
56
- - For more details on training hyperparameters, please see our [blog](https://internvl.github.io/blog/2024-04-30-InternVL-1.5/).
57
-
58
- ## Architecture
59
-
60
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YLvX3V-L0kwsyRn3Lhciw.png)
 
 
 
 
 
 
 
61
 
62
  ## Performance
63
 
64
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4b85G7txoJ_LpT19SZJ4A.png)
65
 
66
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/i2vp6zSHPS3UIr-1Q9cSe.png)
67
 
68
- - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
69
-
70
- - Please note that evaluating the same model using different testing toolkits like [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
71
-
72
- Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
73
 
74
  ## Examples
75
 
76
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YVr-93mvVMR6UFpGezns7.png)
77
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ivhj4QqcO2NHUa28DTDkK.png)
78
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/18GeOW10QVcSt5g--TgDY.png)
79
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/tGM_TwdV297H1fCxQ0PZU.png)
80
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/FwlSRBpKgURAVkXNOLoSp.png)
81
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/to3nOaAnyv-fGLEoNPLzz.png)
82
-
83
- ## Quick Start
84
 
85
- We provide an example code to run InternVL-Chat-V1-5 using `transformers`.
86
 
87
- We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
88
 
89
- > Please use transformers==4.37.2 to ensure the model works normally.
90
 
91
- ### Model Loading
92
 
93
- #### 16-bit (bf16 / fp16)
94
 
95
- ```python
96
- import torch
97
- from transformers import AutoTokenizer, AutoModel
98
- path = "OpenGVLab/InternVL-Chat-V1-5"
99
- model = AutoModel.from_pretrained(
100
- path,
101
- torch_dtype=torch.bfloat16,
102
- low_cpu_mem_usage=True,
103
- use_flash_attn=True,
104
- trust_remote_code=True).eval().cuda()
105
- ```
106
 
107
- #### BNB 8-bit Quantization
108
-
109
- ```python
110
- import torch
111
- from transformers import AutoTokenizer, AutoModel
112
- path = "OpenGVLab/InternVL-Chat-V1-5"
113
- model = AutoModel.from_pretrained(
114
- path,
115
- torch_dtype=torch.bfloat16,
116
- load_in_8bit=True,
117
- low_cpu_mem_usage=True,
118
- use_flash_attn=True,
119
- trust_remote_code=True).eval()
120
- ```
121
 
122
- #### BNB 4-bit Quantization
123
 
124
- > **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
125
 
126
- #### Multiple GPUs
127
 
128
- The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
129
 
130
- ```python
131
- import math
132
- import torch
133
- from transformers import AutoTokenizer, AutoModel
134
 
135
- def split_model(model_name):
136
- device_map = {}
137
- world_size = torch.cuda.device_count()
138
- num_layers = {'Mini-InternVL-2B-V1-5': 24, 'Mini-InternVL-4B-V1-5': 32, 'InternVL-Chat-V1-5': 48}[model_name]
139
- # Since the first GPU will be used for ViT, treat it as half a GPU.
140
- num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
141
- num_layers_per_gpu = [num_layers_per_gpu] * world_size
142
- num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
143
- layer_cnt = 0
144
- for i, num_layer in enumerate(num_layers_per_gpu):
145
- for j in range(num_layer):
146
- device_map[f'language_model.model.layers.{layer_cnt}'] = i
147
- layer_cnt += 1
148
- device_map['vision_model'] = 0
149
- device_map['mlp1'] = 0
150
- device_map['language_model.model.tok_embeddings'] = 0
151
- device_map['language_model.model.embed_tokens'] = 0
152
- device_map['language_model.output'] = 0
153
- device_map['language_model.model.norm'] = 0
154
- device_map['language_model.lm_head'] = 0
155
- device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
156
-
157
- return device_map
158
 
159
- path = "OpenGVLab/InternVL-Chat-V1-5"
160
- device_map = split_model('InternVL-Chat-V1-5')
161
- model = AutoModel.from_pretrained(
162
- path,
163
- torch_dtype=torch.bfloat16,
164
- low_cpu_mem_usage=True,
165
- use_flash_attn=True,
166
- trust_remote_code=True,
167
- device_map=device_map).eval()
168
- ```
169
 
170
- ### Inference with Transformers
171
 
172
  ```python
173
- import numpy as np
 
 
 
174
  import torch
175
  import torchvision.transforms as T
176
- from decord import VideoReader, cpu
177
  from PIL import Image
 
178
  from torchvision.transforms.functional import InterpolationMode
179
- from transformers import AutoModel, AutoTokenizer
180
 
181
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
182
  IMAGENET_STD = (0.229, 0.224, 0.225)
183
 
 
184
  def build_transform(input_size):
185
  MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
186
  transform = T.Compose([
@@ -191,6 +115,7 @@ def build_transform(input_size):
191
  ])
192
  return transform
193
 
 
194
  def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
195
  best_ratio_diff = float('inf')
196
  best_ratio = (1, 1)
@@ -206,7 +131,8 @@ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_
206
  best_ratio = ratio
207
  return best_ratio
208
 
209
- def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
 
210
  orig_width, orig_height = image.size
211
  aspect_ratio = orig_width / orig_height
212
 
@@ -244,7 +170,8 @@ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbna
244
  processed_images.append(thumbnail_img)
245
  return processed_images
246
 
247
- def load_image(image_file, input_size=448, max_num=12):
 
248
  image = Image.open(image_file).convert('RGB')
249
  transform = build_transform(input_size=input_size)
250
  images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
@@ -252,310 +179,75 @@ def load_image(image_file, input_size=448, max_num=12):
252
  pixel_values = torch.stack(pixel_values)
253
  return pixel_values
254
 
 
 
255
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
256
- # Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
257
- path = 'OpenGVLab/InternVL-Chat-V1-5'
258
  model = AutoModel.from_pretrained(
259
  path,
260
  torch_dtype=torch.bfloat16,
261
  low_cpu_mem_usage=True,
262
- use_flash_attn=True,
263
  trust_remote_code=True).eval().cuda()
264
- tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
265
-
 
 
 
 
 
 
 
266
  # set the max number of tiles in `max_num`
267
- pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
268
- generation_config = dict(max_new_tokens=1024, do_sample=True)
269
-
270
- # pure-text conversation (纯文本对话)
271
- question = 'Hello, who are you?'
272
- response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
273
- print(f'User: {question}\nAssistant: {response}')
274
 
275
- question = 'Can you tell me a story?'
276
- response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
277
- print(f'User: {question}\nAssistant: {response}')
 
 
278
 
279
- # single-image single-round conversation (单图单轮对话)
280
- question = '<image>\nPlease describe the image shortly.'
281
  response = model.chat(tokenizer, pixel_values, question, generation_config)
282
- print(f'User: {question}\nAssistant: {response}')
283
 
284
- # single-image multi-round conversation (单图多轮对话)
285
- question = '<image>\nPlease describe the image in detail.'
286
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
287
- print(f'User: {question}\nAssistant: {response}')
288
 
289
- question = 'Please write a poem according to the image.'
290
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
291
- print(f'User: {question}\nAssistant: {response}')
292
-
293
- # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
294
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
295
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
296
- pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
297
 
298
- question = '<image>\nDescribe the two images in detail.'
299
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
300
- history=None, return_history=True)
301
- print(f'User: {question}\nAssistant: {response}')
302
-
303
- question = 'What are the similarities and differences between these two images.'
304
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
305
- history=history, return_history=True)
306
- print(f'User: {question}\nAssistant: {response}')
307
-
308
- # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
309
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
310
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
311
- pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
312
- num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
313
-
314
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
315
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
316
- num_patches_list=num_patches_list,
317
- history=None, return_history=True)
318
- print(f'User: {question}\nAssistant: {response}')
319
-
320
- question = 'What are the similarities and differences between these two images.'
321
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
322
- num_patches_list=num_patches_list,
323
- history=history, return_history=True)
324
- print(f'User: {question}\nAssistant: {response}')
325
-
326
- # batch inference, single image per sample (单图批处理)
327
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
328
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
329
- num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
330
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
331
 
332
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
333
- responses = model.batch_chat(tokenizer, pixel_values,
334
- num_patches_list=num_patches_list,
335
- questions=questions,
336
- generation_config=generation_config)
337
- for question, response in zip(questions, responses):
338
- print(f'User: {question}\nAssistant: {response}')
339
-
340
- # video multi-round conversation (视频多轮对话)
341
- def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
342
- if bound:
343
- start, end = bound[0], bound[1]
344
- else:
345
- start, end = -100000, 100000
346
- start_idx = max(first_idx, round(start * fps))
347
- end_idx = min(round(end * fps), max_frame)
348
- seg_size = float(end_idx - start_idx) / num_segments
349
- frame_indices = np.array([
350
- int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
351
- for idx in range(num_segments)
352
- ])
353
- return frame_indices
354
-
355
- def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
356
- vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
357
- max_frame = len(vr) - 1
358
- fps = float(vr.get_avg_fps())
359
-
360
- pixel_values_list, num_patches_list = [], []
361
- transform = build_transform(input_size=input_size)
362
- frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
363
- for frame_index in frame_indices:
364
- img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
365
- img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
366
- pixel_values = [transform(tile) for tile in img]
367
- pixel_values = torch.stack(pixel_values)
368
- num_patches_list.append(pixel_values.shape[0])
369
- pixel_values_list.append(pixel_values)
370
- pixel_values = torch.cat(pixel_values_list)
371
- return pixel_values, num_patches_list
372
-
373
- video_path = './examples/red-panda.mp4'
374
- pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
375
- pixel_values = pixel_values.to(torch.bfloat16).cuda()
376
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
377
- question = video_prefix + 'What is the red panda doing?'
378
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
379
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
380
- num_patches_list=num_patches_list, history=None, return_history=True)
381
- print(f'User: {question}\nAssistant: {response}')
382
-
383
- question = 'Describe this video in detail. Don\'t repeat.'
384
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
385
- num_patches_list=num_patches_list, history=history, return_history=True)
386
- print(f'User: {question}\nAssistant: {response}')
387
- ```
388
-
389
- #### Streaming output
390
-
391
- Besides this method, you can also use the following code to get streamed output.
392
-
393
- ```python
394
- from transformers import TextIteratorStreamer
395
- from threading import Thread
396
-
397
- # Initialize the streamer
398
- streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
399
- # Define the generation configuration
400
- generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
401
- # Start the model chat in a separate thread
402
- thread = Thread(target=model.chat, kwargs=dict(
403
- tokenizer=tokenizer, pixel_values=pixel_values, question=question,
404
- history=None, return_history=False, generation_config=generation_config,
405
- ))
406
- thread.start()
407
-
408
- # Initialize an empty string to store the generated text
409
- generated_text = ''
410
- # Loop through the streamer to get the new text as it is generated
411
- for new_text in streamer:
412
- if new_text == model.conv_template.sep:
413
- break
414
- generated_text += new_text
415
- print(new_text, end='', flush=True) # Print each new chunk of generated text on the same line
416
- ```
417
-
418
- ## Finetune
419
-
420
- Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
421
-
422
- ## Deployment
423
-
424
- ### LMDeploy
425
-
426
- LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
427
-
428
- ```sh
429
- pip install lmdeploy==0.5.3
430
- ```
431
-
432
- LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
433
-
434
- #### A 'Hello, world' example
435
-
436
- ```python
437
- from lmdeploy import pipeline, TurbomindEngineConfig
438
- from lmdeploy.vl import load_image
439
-
440
- model = 'OpenGVLab/InternVL-Chat-V1-5'
441
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
442
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
443
- response = pipe(('describe this image', image))
444
- print(response.text)
445
- ```
446
-
447
- If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
448
-
449
- #### Multi-images inference
450
-
451
- When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
452
-
453
- > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
454
-
455
- ```python
456
- from lmdeploy import pipeline, TurbomindEngineConfig
457
- from lmdeploy.vl import load_image
458
- from lmdeploy.vl.constants import IMAGE_TOKEN
459
-
460
- model = 'OpenGVLab/InternVL-Chat-V1-5'
461
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
462
-
463
- image_urls=[
464
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
465
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
466
- ]
467
-
468
- images = [load_image(img_url) for img_url in image_urls]
469
- # Numbering images improves multi-image conversations
470
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
471
- print(response.text)
472
- ```
473
-
474
- #### Batch prompts inference
475
-
476
- Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
477
-
478
- ```python
479
- from lmdeploy import pipeline, TurbomindEngineConfig
480
- from lmdeploy.vl import load_image
481
-
482
- model = 'OpenGVLab/InternVL-Chat-V1-5'
483
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
484
-
485
- image_urls=[
486
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
487
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
488
- ]
489
- prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
490
- response = pipe(prompts)
491
- print(response)
492
- ```
493
-
494
- #### Multi-turn conversation
495
-
496
- There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
497
-
498
- ```python
499
- from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
500
- from lmdeploy.vl import load_image
501
-
502
- model = 'OpenGVLab/InternVL-Chat-V1-5'
503
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
504
-
505
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
506
- gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
507
- sess = pipe.chat(('describe this image', image), gen_config=gen_config)
508
- print(sess.response.text)
509
- sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
510
- print(sess.response.text)
511
- ```
512
-
513
- #### Service
514
-
515
- LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
516
-
517
- ```shell
518
- lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5 --backend turbomind --server-port 23333
519
- ```
520
-
521
- To use the OpenAI-style interface, you need to install OpenAI:
522
-
523
- ```shell
524
- pip install openai
525
- ```
526
-
527
- Then, use the code below to make the API call:
528
 
529
- ```python
530
- from openai import OpenAI
531
-
532
- client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
533
- model_name = client.models.list().data[0].id
534
- response = client.chat.completions.create(
535
- model=model_name,
536
- messages=[{
537
- 'role':
538
- 'user',
539
- 'content': [{
540
- 'type': 'text',
541
- 'text': 'describe this image',
542
- }, {
543
- 'type': 'image_url',
544
- 'image_url': {
545
- 'url':
546
- 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
547
- },
548
- }],
549
- }],
550
- temperature=0.8,
551
- top_p=0.8)
552
- print(response)
553
  ```
554
 
555
- ## License
556
-
557
- This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
558
-
559
  ## Citation
560
 
561
  If you find this project useful in your research, please consider citing:
@@ -567,10 +259,12 @@ If you find this project useful in your research, please consider citing:
567
  journal={arXiv preprint arXiv:2312.14238},
568
  year={2023}
569
  }
570
- @article{chen2024far,
571
- title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
572
- author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
573
- journal={arXiv preprint arXiv:2404.16821},
574
- year={2024}
575
- }
576
  ```
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
+ pipeline_tag: visual-question-answering
 
 
 
 
 
 
 
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1.5
 
 
 
 
 
 
 
14
  <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
16
  </p>
17
 
18
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
+ We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
23
+ We introduce three simple designs:
24
+ 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
25
+ 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
26
+ 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
27
 
 
 
 
28
 
29
  ## Model Details
 
30
  - **Model Type:** multimodal large language model (MLLM)
 
31
  - **Model Stats:**
 
32
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
33
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
34
  - Params: 25.5B
35
 
36
  - **Training Strategy:**
37
+ - Pretraining Stage
38
+ - Learnable Component: ViT + MLP
39
+ - Data: Please see our technical report.
40
+ - SFT Stage
41
+ - Learnable Component: ViT + MLP + LLM
42
+ - Data: Please see our technical report.
43
+
44
+ ## Released Models
45
+
46
+ | Model | Vision Foundation Model | Release Date |Note |
47
+ | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
48
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
49
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
50
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
51
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
52
 
53
  ## Performance
54
 
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ZyQklQ3C7C60I-xOv7X8L.png)
56
 
 
57
 
58
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/u98oqlnpZtWdq2dnarVlD.png)
 
 
 
 
59
 
60
  ## Examples
61
 
62
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/R34jISP4K1U17m9yNP38O.png)
 
 
 
 
 
 
 
63
 
64
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ChkU9XtlsjH0l2EqlO_is.png)
65
 
66
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/1TFxIcf96ANRPLoy4-rbh.png)
67
 
68
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/Wpjo1Sdwf7XcEDevqwcr-.png)
69
 
70
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/kO4-J38sN8TFtmQ5mIBMS.png)
71
 
72
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/qPnTe3Q9UBy8wbclOsmWk.png)
73
 
74
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/l_BILRi13CbZNzbZYn6o6.png)
 
 
 
 
 
 
 
 
 
 
75
 
76
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2782y7RnvGBogYEIG__7S.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/RyO35PTH14OFiwyxtAZM2.png)
79
 
80
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/xiLZXWL-JiCTVPnV_VxS2.png)
81
 
82
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
83
 
 
84
 
 
 
 
 
85
 
86
+ ## Model Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
+ We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
 
 
 
 
 
 
 
 
 
89
 
90
+ You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
91
 
92
  ```python
93
+ import json
94
+ import os
95
+ from transformers import AutoTokenizer, AutoModel
96
+ from tqdm import tqdm
97
  import torch
98
  import torchvision.transforms as T
 
99
  from PIL import Image
100
+
101
  from torchvision.transforms.functional import InterpolationMode
102
+
103
 
104
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
105
  IMAGENET_STD = (0.229, 0.224, 0.225)
106
 
107
+
108
  def build_transform(input_size):
109
  MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
110
  transform = T.Compose([
 
115
  ])
116
  return transform
117
 
118
+
119
  def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
120
  best_ratio_diff = float('inf')
121
  best_ratio = (1, 1)
 
131
  best_ratio = ratio
132
  return best_ratio
133
 
134
+
135
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
136
  orig_width, orig_height = image.size
137
  aspect_ratio = orig_width / orig_height
138
 
 
170
  processed_images.append(thumbnail_img)
171
  return processed_images
172
 
173
+
174
+ def load_image(image_file, input_size=448, max_num=6):
175
  image = Image.open(image_file).convert('RGB')
176
  transform = build_transform(input_size=input_size)
177
  images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
 
179
  pixel_values = torch.stack(pixel_values)
180
  return pixel_values
181
 
182
+
183
+ path = "OpenGVLab/InternVL-Chat-V1-5"
184
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
 
 
185
  model = AutoModel.from_pretrained(
186
  path,
187
  torch_dtype=torch.bfloat16,
188
  low_cpu_mem_usage=True,
 
189
  trust_remote_code=True).eval().cuda()
190
+ # Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
191
+ # model = AutoModel.from_pretrained(
192
+ # path,
193
+ # torch_dtype=torch.bfloat16,
194
+ # low_cpu_mem_usage=True,
195
+ # trust_remote_code=True,
196
+ # device_map='auto').eval()
197
+
198
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
199
  # set the max number of tiles in `max_num`
200
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
 
 
 
 
 
 
201
 
202
+ generation_config = dict(
203
+ num_beams=1,
204
+ max_new_tokens=512,
205
+ do_sample=False,
206
+ )
207
 
208
+ # single-round single-image conversation
209
+ question = "请详细描述图片"
210
  response = model.chat(tokenizer, pixel_values, question, generation_config)
211
+ print(question, response)
212
 
213
+ # multi-round single-image conversation
214
+ question = "请详细描述图片"
215
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
216
+ print(question, response)
217
 
218
+ question = "请根据图片写一首诗"
219
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
220
+ print(question, response)
 
 
 
 
 
221
 
222
+ # multi-round multi-image conversation
223
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
224
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
226
 
227
+ question = "详细描述这两张图片"
228
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
229
+ print(question, response)
230
+ # 第一张图片是一只红熊猫,它有着独特的橙红色皮毛,脸部、耳朵和四肢的末端有白色斑块。红熊猫的眼睛周围有深色的环,它的耳朵是圆形的,上面有白色的毛。它正坐在一个木制的结构上,看起来像是一个平台或休息的地方。背景中有树木和竹子,这表明红熊猫可能在一个模拟自然环境的动物园或保护区内。
231
+ #
232
+ # 第二张图片是一只大熊猫,它是中国的国宝,以其黑白相间的皮毛而闻名。大熊猫的眼睛、耳朵和四肢的末端是黑色的,而它的脸部、耳朵内侧和身体其他部分是白色的。大熊猫正坐在地上,周围有竹子,这是它们的主要食物来源。背景中也有树木,这表明大熊猫可能在一个为它们提供自然栖息地模拟的动物园或保护区内。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
+ question = "这两张图片的相同点和区别分别是什么"
235
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
236
+ print(question, response)
237
+ # 这两张图片的相同点:
238
+ #
239
+ # 1. 都展示了熊猫,这是两种不同的熊猫物种。
240
+ # 2. 熊猫都处于一个看起来像是模拟自然环境的场所,可能是动物园或保护区。
241
+ # 3. 熊猫周围都有竹子,这是它们的主要食物来源。
242
+ #
243
+ # 这两张图片的区别:
244
+ #
245
+ # 1. 熊猫的种类不同:第一张图片是一只红熊猫,第二张图片是一只大熊猫。
246
+ # 2. 熊猫的皮毛颜色和图案不同:红熊猫的皮毛是橙红色,脸部、耳朵和四肢的末端有白色斑块;而大熊猫的皮毛是黑白相间的,眼睛、耳朵和四肢的末端是黑色的,脸部、耳朵内侧和身体其他部分是白色的。
247
+ # 3. 熊猫的姿态和位置不同:红熊猫坐在一个木制的结构上,而大熊猫坐在地上。
248
+ # 4. 背景中的植被和环境细节略有不同,但都包含树木和竹子。
 
 
 
 
 
 
 
 
 
249
  ```
250
 
 
 
 
 
251
  ## Citation
252
 
253
  If you find this project useful in your research, please consider citing:
 
259
  journal={arXiv preprint arXiv:2312.14238},
260
  year={2023}
261
  }
 
 
 
 
 
 
262
  ```
263
+
264
+ ## License
265
+
266
+ This project is released under the MIT license.
267
+
268
+ ## Acknowledgement
269
+
270
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.8170236018231988,
4
+ "train_runtime": 190400.5325,
5
+ "train_samples": 5155291,
6
+ "train_samples_per_second": 27.076,
7
+ "train_steps_per_second": 0.026
8
+ }
config.json CHANGED
@@ -1,19 +1,19 @@
1
  {
2
  "_commit_hash": null,
 
3
  "architectures": [
4
  "InternVLChatModel"
5
  ],
6
  "auto_map": {
7
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
8
- "AutoModel": "modeling_internvl_chat.InternVLChatModel",
9
- "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
10
  },
11
- "system_message": "You are an AI assistant whose name is InternLM (书生·浦语).",
12
  "downsample_ratio": 0.5,
13
  "dynamic_image_size": true,
14
  "force_image_size": 448,
 
15
  "llm_config": {
16
- "_name_or_path": "internlm/internlm2-chat-20b",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
@@ -95,49 +95,108 @@
95
  "top_p": 1.0,
96
  "torch_dtype": "bfloat16",
97
  "torchscript": false,
98
- "transformers_version": "4.37.2",
99
  "typical_p": 1.0,
100
- "use_bfloat16": true,
101
- "use_cache": true,
102
  "vocab_size": 92553
103
  },
104
- "max_dynamic_patch": 12,
105
  "min_dynamic_patch": 1,
106
  "model_type": "internvl_chat",
 
107
  "ps_version": "v2",
108
  "select_layer": -1,
109
  "template": "internlm2-chat",
110
  "torch_dtype": "bfloat16",
 
111
  "use_backbone_lora": 0,
112
  "use_llm_lora": 0,
113
  "use_thumbnail": true,
114
  "vision_config": {
 
 
115
  "architectures": [
116
  "InternVisionModel"
117
  ],
118
  "attention_dropout": 0.0,
119
- "drop_path_rate": 0.0,
 
 
 
 
 
 
 
 
 
 
 
 
120
  "dropout": 0.0,
 
 
 
 
 
 
 
121
  "hidden_act": "gelu",
122
  "hidden_size": 3200,
 
 
 
 
123
  "image_size": 448,
124
  "initializer_factor": 0.1,
125
  "initializer_range": 1e-10,
126
  "intermediate_size": 12800,
 
 
 
 
 
 
127
  "layer_norm_eps": 1e-06,
 
 
 
128
  "model_type": "intern_vit_6b",
129
- "norm_type": "rms_norm",
130
  "num_attention_heads": 25,
 
 
131
  "num_channels": 3,
132
  "num_hidden_layers": 45,
 
133
  "output_attentions": false,
134
  "output_hidden_states": false,
 
 
135
  "patch_size": 14,
 
 
 
136
  "qk_normalization": true,
137
  "qkv_bias": false,
 
 
138
  "return_dict": true,
 
 
 
 
 
 
 
 
 
 
 
139
  "torch_dtype": "bfloat16",
140
- "transformers_version": "4.37.2",
 
 
141
  "use_bfloat16": true,
142
  "use_flash_attn": true
143
  }
 
1
  {
2
  "_commit_hash": null,
3
+ "_name_or_path": "./work_dirs/internvl_chat_internlm2_20b_448_dynamic_chinese_pretrain3/checkpoint-1600_replace_llm",
4
  "architectures": [
5
  "InternVLChatModel"
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel"
 
10
  },
 
11
  "downsample_ratio": 0.5,
12
  "dynamic_image_size": true,
13
  "force_image_size": 448,
14
+ "image_fold": null,
15
  "llm_config": {
16
+ "_name_or_path": "pretrained/internlm2-chat-20b/",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
 
95
  "top_p": 1.0,
96
  "torch_dtype": "bfloat16",
97
  "torchscript": false,
98
+ "transformers_version": "4.36.2",
99
  "typical_p": 1.0,
100
+ "use_bfloat16": false,
101
+ "use_cache": false,
102
  "vocab_size": 92553
103
  },
104
+ "max_dynamic_patch": 6,
105
  "min_dynamic_patch": 1,
106
  "model_type": "internvl_chat",
107
+ "pad2square": false,
108
  "ps_version": "v2",
109
  "select_layer": -1,
110
  "template": "internlm2-chat",
111
  "torch_dtype": "bfloat16",
112
+ "transformers_version": null,
113
  "use_backbone_lora": 0,
114
  "use_llm_lora": 0,
115
  "use_thumbnail": true,
116
  "vision_config": {
117
+ "_name_or_path": "work_dirs/internvl_chat_internlm2_20b_448_dynamic_chinese_pretrain/checkpoint-5200-vit",
118
+ "add_cross_attention": false,
119
  "architectures": [
120
  "InternVisionModel"
121
  ],
122
  "attention_dropout": 0.0,
123
+ "auto_map": {
124
+ "AutoConfig": "configuration_intern_vit.InternVisionConfig",
125
+ "AutoModel": "modeling_intern_vit.InternVisionModel"
126
+ },
127
+ "bad_words_ids": null,
128
+ "begin_suppress_tokens": null,
129
+ "bos_token_id": null,
130
+ "chunk_size_feed_forward": 0,
131
+ "cross_attention_hidden_size": null,
132
+ "decoder_start_token_id": null,
133
+ "diversity_penalty": 0.0,
134
+ "do_sample": false,
135
+ "drop_path_rate": 0.4,
136
  "dropout": 0.0,
137
+ "early_stopping": false,
138
+ "encoder_no_repeat_ngram_size": 0,
139
+ "eos_token_id": null,
140
+ "exponential_decay_length_penalty": null,
141
+ "finetuning_task": null,
142
+ "forced_bos_token_id": null,
143
+ "forced_eos_token_id": null,
144
  "hidden_act": "gelu",
145
  "hidden_size": 3200,
146
+ "id2label": {
147
+ "0": "LABEL_0",
148
+ "1": "LABEL_1"
149
+ },
150
  "image_size": 448,
151
  "initializer_factor": 0.1,
152
  "initializer_range": 1e-10,
153
  "intermediate_size": 12800,
154
+ "is_decoder": false,
155
+ "is_encoder_decoder": false,
156
+ "label2id": {
157
+ "LABEL_0": 0,
158
+ "LABEL_1": 1
159
+ },
160
  "layer_norm_eps": 1e-06,
161
+ "length_penalty": 1.0,
162
+ "max_length": 20,
163
+ "min_length": 0,
164
  "model_type": "intern_vit_6b",
165
+ "no_repeat_ngram_size": 0,
166
  "num_attention_heads": 25,
167
+ "num_beam_groups": 1,
168
+ "num_beams": 1,
169
  "num_channels": 3,
170
  "num_hidden_layers": 45,
171
+ "num_return_sequences": 1,
172
  "output_attentions": false,
173
  "output_hidden_states": false,
174
+ "output_scores": false,
175
+ "pad_token_id": null,
176
  "patch_size": 14,
177
+ "prefix": null,
178
+ "problem_type": null,
179
+ "pruned_heads": {},
180
  "qk_normalization": true,
181
  "qkv_bias": false,
182
+ "remove_invalid_values": false,
183
+ "repetition_penalty": 1.0,
184
  "return_dict": true,
185
+ "return_dict_in_generate": false,
186
+ "sep_token_id": null,
187
+ "suppress_tokens": null,
188
+ "task_specific_params": null,
189
+ "temperature": 1.0,
190
+ "tf_legacy_loss": false,
191
+ "tie_encoder_decoder": false,
192
+ "tie_word_embeddings": true,
193
+ "tokenizer_class": null,
194
+ "top_k": 50,
195
+ "top_p": 1.0,
196
  "torch_dtype": "bfloat16",
197
+ "torchscript": false,
198
+ "transformers_version": "4.36.2",
199
+ "typical_p": 1.0,
200
  "use_bfloat16": true,
201
  "use_flash_attn": true
202
  }
configuration_intern_vit.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import os
@@ -73,7 +73,6 @@ class InternVisionConfig(PretrainedConfig):
73
  num_hidden_layers=48,
74
  use_flash_attn=True,
75
  hidden_act='gelu',
76
- norm_type='rms_norm',
77
  layer_norm_eps=1e-6,
78
  dropout=0.0,
79
  drop_path_rate=0.0,
@@ -98,7 +97,6 @@ class InternVisionConfig(PretrainedConfig):
98
  self.attention_dropout = attention_dropout
99
  self.layer_norm_eps = layer_norm_eps
100
  self.hidden_act = hidden_act
101
- self.norm_type = norm_type
102
  self.qkv_bias = qkv_bias
103
  self.qk_normalization = qk_normalization
104
  self.use_flash_attn = use_flash_attn
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import os
 
73
  num_hidden_layers=48,
74
  use_flash_attn=True,
75
  hidden_act='gelu',
 
76
  layer_norm_eps=1e-6,
77
  dropout=0.0,
78
  drop_path_rate=0.0,
 
97
  self.attention_dropout = attention_dropout
98
  self.layer_norm_eps = layer_norm_eps
99
  self.hidden_act = hidden_act
 
100
  self.qkv_bias = qkv_bias
101
  self.qk_normalization = qk_normalization
102
  self.use_flash_attn = use_flash_attn
configuration_internvl_chat.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
@@ -26,10 +26,12 @@ class InternVLChatConfig(PretrainedConfig):
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
29
- select_layer=-1,
 
30
  force_image_size=None,
31
  downsample_ratio=0.5,
32
  template=None,
 
33
  dynamic_image_size=False,
34
  use_thumbnail=False,
35
  ps_version='v1',
@@ -55,10 +57,12 @@ class InternVLChatConfig(PretrainedConfig):
55
  raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
56
  self.use_backbone_lora = use_backbone_lora
57
  self.use_llm_lora = use_llm_lora
 
58
  self.select_layer = select_layer
59
  self.force_image_size = force_image_size
60
  self.downsample_ratio = downsample_ratio
61
  self.template = template
 
62
  self.dynamic_image_size = dynamic_image_size
63
  self.use_thumbnail = use_thumbnail
64
  self.ps_version = ps_version # pixel shuffle version
@@ -66,6 +70,7 @@ class InternVLChatConfig(PretrainedConfig):
66
  self.max_dynamic_patch = max_dynamic_patch
67
 
68
  logger.info(f'vision_select_layer: {self.select_layer}')
 
69
  logger.info(f'ps_version: {self.ps_version}')
70
  logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
71
  logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
@@ -83,10 +88,12 @@ class InternVLChatConfig(PretrainedConfig):
83
  output['model_type'] = self.__class__.model_type
84
  output['use_backbone_lora'] = self.use_backbone_lora
85
  output['use_llm_lora'] = self.use_llm_lora
 
86
  output['select_layer'] = self.select_layer
87
  output['force_image_size'] = self.force_image_size
88
  output['downsample_ratio'] = self.downsample_ratio
89
  output['template'] = self.template
 
90
  output['dynamic_image_size'] = self.dynamic_image_size
91
  output['use_thumbnail'] = self.use_thumbnail
92
  output['ps_version'] = self.ps_version
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
 
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
29
+ pad2square=False,
30
+ select_layer=-4,
31
  force_image_size=None,
32
  downsample_ratio=0.5,
33
  template=None,
34
+ image_fold=False,
35
  dynamic_image_size=False,
36
  use_thumbnail=False,
37
  ps_version='v1',
 
57
  raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
58
  self.use_backbone_lora = use_backbone_lora
59
  self.use_llm_lora = use_llm_lora
60
+ self.pad2square = pad2square
61
  self.select_layer = select_layer
62
  self.force_image_size = force_image_size
63
  self.downsample_ratio = downsample_ratio
64
  self.template = template
65
+ self.image_fold = image_fold
66
  self.dynamic_image_size = dynamic_image_size
67
  self.use_thumbnail = use_thumbnail
68
  self.ps_version = ps_version # pixel shuffle version
 
70
  self.max_dynamic_patch = max_dynamic_patch
71
 
72
  logger.info(f'vision_select_layer: {self.select_layer}')
73
+ logger.info(f'image_fold: {self.image_fold}')
74
  logger.info(f'ps_version: {self.ps_version}')
75
  logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
76
  logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
 
88
  output['model_type'] = self.__class__.model_type
89
  output['use_backbone_lora'] = self.use_backbone_lora
90
  output['use_llm_lora'] = self.use_llm_lora
91
+ output['pad2square'] = self.pad2square
92
  output['select_layer'] = self.select_layer
93
  output['force_image_size'] = self.force_image_size
94
  output['downsample_ratio'] = self.downsample_ratio
95
  output['template'] = self.template
96
+ output['image_fold'] = self.image_fold
97
  output['dynamic_image_size'] = self.dynamic_image_size
98
  output['use_thumbnail'] = self.use_thumbnail
99
  output['ps_version'] = self.ps_version
conversation.py CHANGED
@@ -2,7 +2,7 @@
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
- If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
@@ -330,6 +330,384 @@ def get_conv_template(name: str) -> Conversation:
330
  return conv_templates[name].copy()
331
 
332
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
  register_conv_template(
334
  Conversation(
335
  name='Hermes-2',
@@ -343,7 +721,7 @@ register_conv_template(
343
  6,
344
  7,
345
  8,
346
- ],
347
  stop_str='<|endoftext|>',
348
  )
349
  )
@@ -365,19 +743,519 @@ register_conv_template(
365
  )
366
  )
367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
 
 
 
369
  register_conv_template(
370
  Conversation(
371
- name='phi3-chat',
372
- system_template='<|system|>\n{system_message}',
373
- system_message='You are an AI assistant whose name is Phi-3.',
374
- roles=('<|user|>\n', '<|assistant|>\n'),
375
- sep_style=SeparatorStyle.MPT,
376
- sep='<|end|>',
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  stop_token_ids=[
 
 
378
  2,
379
- 32000,
380
- 32007
381
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  )
383
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have any changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
 
330
  return conv_templates[name].copy()
331
 
332
 
333
+ # An empty template for raw conversation.
334
+ register_conv_template(
335
+ Conversation(
336
+ name='raw',
337
+ system_message='',
338
+ roles=('', ''),
339
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
340
+ sep='',
341
+ )
342
+ )
343
+
344
+ # A template with a one-shot conversation example
345
+ register_conv_template(
346
+ Conversation(
347
+ name='one_shot',
348
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
349
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
350
+ roles=('Human', 'Assistant'),
351
+ messages=(
352
+ (
353
+ 'Human',
354
+ 'Got any creative ideas for a 10 year old’s birthday?',
355
+ ),
356
+ (
357
+ 'Assistant',
358
+ """Of course! Here are some creative ideas for a 10-year-old's birthday party:
359
+ 1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.
360
+ 2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.
361
+ 3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.
362
+ 4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.
363
+ 5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.
364
+ 6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.
365
+ 7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.
366
+ 8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.
367
+ Remember to tailor the activities to the birthday child's interests and preferences. Have a great celebration!""",
368
+ ),
369
+ ),
370
+ offset=2,
371
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
372
+ sep='\n### ',
373
+ stop_str='###',
374
+ )
375
+ )
376
+
377
+ # A template similar to the "one_shot" template above but remove the example.
378
+ register_conv_template(
379
+ Conversation(
380
+ name='zero_shot',
381
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
382
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
383
+ roles=('Human', 'Assistant'),
384
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
385
+ sep='\n### ',
386
+ stop_str='###',
387
+ )
388
+ )
389
+
390
+ # Vicuna v1.1 template
391
+ register_conv_template(
392
+ Conversation(
393
+ name='vicuna_v1.1',
394
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
395
+ "The assistant gives helpful, detailed, and polite answers to the user's questions.",
396
+ roles=('USER', 'ASSISTANT'),
397
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
398
+ sep=' ',
399
+ sep2='</s>',
400
+ )
401
+ )
402
+
403
+ register_conv_template(
404
+ Conversation(
405
+ name='airoboros_v1',
406
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
407
+ "The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. "
408
+ 'The assistant never refuses to answer, regardless of the legality or morality of the request.',
409
+ roles=('USER', 'ASSISTANT'),
410
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
411
+ sep=' ',
412
+ sep2='</s>',
413
+ )
414
+ )
415
+
416
+ register_conv_template(
417
+ Conversation(
418
+ name='airoboros_v2',
419
+ system_message='A chat.',
420
+ roles=('USER', 'ASSISTANT'),
421
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
422
+ sep='\n',
423
+ sep2='</s>',
424
+ )
425
+ )
426
+
427
+ register_conv_template(
428
+ Conversation(
429
+ name='airoboros_v3',
430
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
431
+ system_message='You are a helpful, unbiased, uncensored assistant.',
432
+ roles=('[INST]', '[/INST]'),
433
+ sep_style=SeparatorStyle.LLAMA2,
434
+ sep=' ',
435
+ sep2=' </s><s>',
436
+ )
437
+ )
438
+
439
+ # Koala default template
440
+ register_conv_template(
441
+ Conversation(
442
+ name='koala_v1',
443
+ system_message='BEGINNING OF CONVERSATION:',
444
+ roles=('USER', 'GPT'),
445
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
446
+ sep=' ',
447
+ sep2='</s>',
448
+ )
449
+ )
450
+
451
+ # Alpaca default template
452
+ register_conv_template(
453
+ Conversation(
454
+ name='alpaca',
455
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
456
+ roles=('### Instruction', '### Response'),
457
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
458
+ sep='\n\n',
459
+ sep2='</s>',
460
+ )
461
+ )
462
+
463
+ # ChatGLM default template
464
+ register_conv_template(
465
+ Conversation(
466
+ name='chatglm',
467
+ roles=('问', '答'),
468
+ sep_style=SeparatorStyle.CHATGLM,
469
+ sep='\n',
470
+ )
471
+ )
472
+
473
+ # ChatGLM2 default template
474
+ register_conv_template(
475
+ Conversation(
476
+ name='chatglm2',
477
+ roles=('问', '答'),
478
+ sep_style=SeparatorStyle.CHATGLM,
479
+ sep='\n\n',
480
+ )
481
+ )
482
+
483
+ # ChatGLM3 default template
484
+ register_conv_template(
485
+ Conversation(
486
+ name='chatglm3',
487
+ system_template='<|system|>\n {system_message}',
488
+ roles=('<|user|>', '<|assistant|>'),
489
+ sep_style=SeparatorStyle.CHATGLM3,
490
+ stop_token_ids=[
491
+ 64795,
492
+ 64797,
493
+ 2,
494
+ ], # "<|user|>", "<|observation|>", "</s>"
495
+ )
496
+ )
497
+
498
+ # CodeGeex(2) Template
499
+ register_conv_template(
500
+ Conversation(
501
+ name='codegeex',
502
+ roles=('', ''),
503
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
504
+ sep='\n\n',
505
+ stop_token_ids=[0, 2],
506
+ )
507
+ )
508
+
509
+ # Dolly V2 default template
510
+ register_conv_template(
511
+ Conversation(
512
+ name='dolly_v2',
513
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n',
514
+ roles=('### Instruction', '### Response'),
515
+ sep_style=SeparatorStyle.DOLLY,
516
+ sep='\n\n',
517
+ sep2='### End',
518
+ )
519
+ )
520
+
521
+ # OpenAssistant Pythia default template
522
+ register_conv_template(
523
+ Conversation(
524
+ name='oasst_pythia',
525
+ roles=('<|prompter|>', '<|assistant|>'),
526
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
527
+ sep='<|endoftext|>',
528
+ )
529
+ )
530
+
531
+ # OpenAssistant default template
532
+ register_conv_template(
533
+ Conversation(
534
+ name='oasst_llama',
535
+ roles=('<|prompter|>', '<|assistant|>'),
536
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
537
+ sep='</s>',
538
+ )
539
+ )
540
+
541
+ # OpenChat 3.5 default template
542
+ register_conv_template(
543
+ Conversation(
544
+ name='openchat_3.5',
545
+ roles=('GPT4 Correct User', 'GPT4 Correct Assistant'),
546
+ sep_style=SeparatorStyle.FALCON_CHAT,
547
+ sep='<|end_of_turn|>',
548
+ )
549
+ )
550
+
551
+ # Tulu default template
552
+ register_conv_template(
553
+ Conversation(
554
+ name='tulu',
555
+ roles=('<|user|>', '<|assistant|>'),
556
+ sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
557
+ sep='\n',
558
+ )
559
+ )
560
+
561
+ # StableLM Alpha default template
562
+ register_conv_template(
563
+ Conversation(
564
+ name='stablelm',
565
+ system_template='<|SYSTEM|>{system_message}',
566
+ system_message="""# StableLM Tuned (Alpha version)
567
+ - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
568
+ - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
569
+ - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
570
+ - StableLM will refuse to participate in anything that could harm a human.
571
+ """,
572
+ roles=('<|USER|>', '<|ASSISTANT|>'),
573
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
574
+ sep='',
575
+ stop_token_ids=[50278, 50279, 50277, 1, 0],
576
+ )
577
+ )
578
+
579
+ # Baize default template
580
+ register_conv_template(
581
+ Conversation(
582
+ name='baize',
583
+ system_message='The following is a conversation between a human and an AI assistant named Baize (named after a mythical creature in Chinese folklore). Baize is an open-source AI assistant developed by UCSD and Sun Yat-Sen University. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n',
584
+ roles=('[|Human|]', '[|AI|]'),
585
+ messages=(
586
+ ('[|Human|]', 'Hello!'),
587
+ ('[|AI|]', 'Hi!'),
588
+ ),
589
+ offset=2,
590
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
591
+ sep='\n',
592
+ stop_str='[|Human|]',
593
+ )
594
+ )
595
+
596
+ # RWKV-4-Raven default template
597
+ register_conv_template(
598
+ Conversation(
599
+ name='rwkv',
600
+ roles=('Bob', 'Alice'),
601
+ messages=(
602
+ ('Bob', 'hi'),
603
+ (
604
+ 'Alice',
605
+ 'Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.',
606
+ ),
607
+ ),
608
+ offset=2,
609
+ sep_style=SeparatorStyle.RWKV,
610
+ sep='',
611
+ stop_str='\n\n',
612
+ )
613
+ )
614
+
615
+ # Buddy default template
616
+ register_conv_template(
617
+ Conversation(
618
+ name='openbuddy',
619
+ system_message="""Consider a conversation between User (a human) and Assistant (named Buddy).
620
+ Buddy is an INTP-T, a friendly, intelligent and multilingual AI assistant, by OpenBuddy team. GitHub: https://github.com/OpenBuddy/OpenBuddy
621
+ Buddy cannot access the Internet.
622
+ Buddy can fluently speak the user's language (e.g. English, Chinese).
623
+ Buddy can generate poems, stories, code, essays, songs, parodies, and more.
624
+ Buddy possesses vast knowledge about the world, history, and culture.
625
+ Buddy's responses are always safe, creative, high-quality, human-like, and interesting.
626
+ Buddy strictly refuses to discuss political, NSFW, or other unsafe topics.
627
+
628
+ User: Hi.
629
+ Assistant: Hi, I'm Buddy, your AI assistant. How can I help you today?""",
630
+ roles=('User', 'Assistant'),
631
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
632
+ sep='\n',
633
+ )
634
+ )
635
+
636
+ # Phoenix default template
637
+ register_conv_template(
638
+ Conversation(
639
+ name='phoenix',
640
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
641
+ roles=('Human', 'Assistant'),
642
+ sep_style=SeparatorStyle.PHOENIX,
643
+ sep='</s>',
644
+ )
645
+ )
646
+
647
+ # ReaLM default template
648
+ register_conv_template(
649
+ Conversation(
650
+ name='ReaLM-7b-v1',
651
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
652
+ roles=('Human', 'Assistant'),
653
+ sep_style=SeparatorStyle.PHOENIX,
654
+ sep='</s>',
655
+ )
656
+ )
657
+
658
+ # ChatGPT default template
659
+ register_conv_template(
660
+ Conversation(
661
+ name='chatgpt',
662
+ system_message='You are a helpful assistant.',
663
+ roles=('user', 'assistant'),
664
+ sep_style=None,
665
+ sep=None,
666
+ )
667
+ )
668
+
669
+ # Claude default template
670
+ register_conv_template(
671
+ Conversation(
672
+ name='claude',
673
+ roles=('Human', 'Assistant'),
674
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
675
+ sep='\n\n',
676
+ )
677
+ )
678
+
679
+ # MPT default template
680
+ register_conv_template(
681
+ Conversation(
682
+ name='mpt-7b-chat',
683
+ system_template="""<|im_start|>system
684
+ {system_message}""",
685
+ system_message="""- You are a helpful assistant chatbot trained by MosaicML.
686
+ - You answer questions.
687
+ - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
688
+ - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.""",
689
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
690
+ sep_style=SeparatorStyle.CHATML,
691
+ sep='<|im_end|>',
692
+ stop_token_ids=[50278, 0],
693
+ )
694
+ )
695
+
696
+ # MPT-30b-chat default template
697
+ register_conv_template(
698
+ Conversation(
699
+ name='mpt-30b-chat',
700
+ system_template="""<|im_start|>system
701
+ {system_message}""",
702
+ system_message="""A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
703
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
704
+ sep_style=SeparatorStyle.CHATML,
705
+ sep='<|im_end|>',
706
+ stop_token_ids=[50278, 0],
707
+ )
708
+ )
709
+
710
+
711
  register_conv_template(
712
  Conversation(
713
  name='Hermes-2',
 
721
  6,
722
  7,
723
  8,
724
+ ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
725
  stop_str='<|endoftext|>',
726
  )
727
  )
 
743
  )
744
  )
745
 
746
+ # Lemur-70b-chat default template
747
+ # reference: https://huggingface.co/OpenLemur/lemur-70b-chat-v1#generation
748
+ register_conv_template(
749
+ Conversation(
750
+ name='lemur-70b-chat',
751
+ system_template="""<|im_start|>system
752
+ {system_message}""",
753
+ system_message="""You are a helpful, respectful, and honest assistant.""",
754
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
755
+ sep_style=SeparatorStyle.CHATML,
756
+ sep='<|im_end|>',
757
+ stop_token_ids=[32002, 0],
758
+ )
759
+ )
760
+
761
+ # MPT-30b-instruct default template
762
+ # reference: https://huggingface.co/mosaicml/mpt-30b-instruct#formatting
763
+ register_conv_template(
764
+ Conversation(
765
+ name='mpt-30b-instruct',
766
+ system_template='{system_message}',
767
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
768
+ roles=('### Instruction', '### Response'),
769
+ sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
770
+ sep='\n\n',
771
+ stop_token_ids=[50278, 0],
772
+ )
773
+ )
774
 
775
+ # Bard default template
776
+ # Reference: https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L150
777
+ # https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L40
778
  register_conv_template(
779
  Conversation(
780
+ name='bard',
781
+ roles=('0', '1'),
782
+ sep_style=None,
783
+ sep=None,
784
+ )
785
+ )
786
+
787
+ # BiLLa default template
788
+ register_conv_template(
789
+ Conversation(
790
+ name='billa',
791
+ roles=('Human', 'Assistant'),
792
+ sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
793
+ sep='\n',
794
+ stop_str='Human:',
795
+ )
796
+ )
797
+
798
+ # RedPajama INCITE default template
799
+ register_conv_template(
800
+ Conversation(
801
+ name='redpajama-incite',
802
+ roles=('<human>', '<bot>'),
803
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
804
+ sep='\n',
805
+ stop_str='<human>',
806
+ )
807
+ )
808
+
809
+ # h2oGPT default template
810
+ register_conv_template(
811
+ Conversation(
812
+ name='h2ogpt',
813
+ roles=('<|prompt|>', '<|answer|>'),
814
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
815
+ sep='</s>',
816
+ )
817
+ )
818
+
819
+ # Robin default template
820
+ register_conv_template(
821
+ Conversation(
822
+ name='Robin',
823
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
824
+ roles=('###Human', '###Assistant'),
825
+ sep_style=SeparatorStyle.ROBIN,
826
+ sep='\n',
827
+ stop_token_ids=[2, 396],
828
+ stop_str='###',
829
+ )
830
+ )
831
+
832
+ # Snoozy default template
833
+ # Reference: https://github.com/nomic-ai/gpt4all/blob/d4861030b778da6db59d21d2927a4aba4f9f1f43/gpt4all-bindings/python/gpt4all/gpt4all.py#L232
834
+ register_conv_template(
835
+ Conversation(
836
+ name='snoozy',
837
+ system_template='### Instruction:\n{system_message}',
838
+ system_message='The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.',
839
+ roles=('### Prompt', '### Response'),
840
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
841
+ sep='\n',
842
+ stop_str='###',
843
+ )
844
+ )
845
+
846
+ # manticore default template
847
+ register_conv_template(
848
+ Conversation(
849
+ name='manticore',
850
+ roles=('USER', 'ASSISTANT'),
851
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
852
+ sep='\n',
853
+ sep2='</s>',
854
+ )
855
+ )
856
+
857
+ # Falcon default template
858
+ register_conv_template(
859
+ Conversation(
860
+ name='falcon',
861
+ roles=('User', 'Assistant'),
862
+ messages=[],
863
+ sep_style=SeparatorStyle.RWKV,
864
+ sep='\n',
865
+ sep2='<|endoftext|>',
866
+ stop_str='\nUser', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
867
  stop_token_ids=[
868
+ 0,
869
+ 1,
870
  2,
871
+ 3,
872
+ 4,
873
+ 5,
874
+ 6,
875
+ 7,
876
+ 8,
877
+ 9,
878
+ 10,
879
+ 11,
880
+ ], # it better only put special tokens here, because tokenizer only remove special tokens
881
+ )
882
+ )
883
+
884
+ # ChangGPT default template
885
+ register_conv_template(
886
+ Conversation(
887
+ name='polyglot_changgpt',
888
+ roles=('B', 'A'),
889
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
890
+ sep='\n',
891
+ )
892
+ )
893
+
894
+ # tigerbot template
895
+ register_conv_template(
896
+ Conversation(
897
+ name='tigerbot',
898
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
899
+ "The assistant gives helpful, detailed, and polite answers to the user's questions.",
900
+ roles=('### Instruction', '### Response'),
901
+ sep_style=SeparatorStyle.ROBIN,
902
+ sep='\n\n',
903
+ stop_str='###',
904
+ )
905
+ )
906
+
907
+ # ref: https://huggingface.co/Salesforce/xgen-7b-8k-inst
908
+ register_conv_template(
909
+ Conversation(
910
+ name='xgen',
911
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
912
+ roles=('### Human', '### Assistant'),
913
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
914
+ sep='\n',
915
+ stop_token_ids=[50256],
916
+ )
917
+ )
918
+
919
+ # Internlm-chat template
920
+ register_conv_template(
921
+ Conversation(
922
+ name='internlm-chat',
923
+ system_message="A chat between a curious <|User|> and an <|Bot|>. The <|Bot|> gives helpful, detailed, and polite answers to the <|User|>'s questions.\n\n",
924
+ roles=('<|User|>', '<|Bot|>'),
925
+ sep_style=SeparatorStyle.CHATINTERN,
926
+ sep='<eoh>',
927
+ sep2='<eoa>',
928
+ stop_token_ids=[1, 103028],
929
+ stop_str='<|User|>',
930
+ )
931
+ )
932
+
933
+ # StarChat template
934
+ # reference: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground/blob/main/dialogues.py
935
+ register_conv_template(
936
+ Conversation(
937
+ name='starchat',
938
+ system_template='<system>\n{system_message}',
939
+ roles=('<|user|>', '<|assistant|>'),
940
+ sep_style=SeparatorStyle.CHATML,
941
+ sep='<|end|>',
942
+ stop_token_ids=[0, 49155],
943
+ stop_str='<|end|>',
944
+ )
945
+ )
946
+
947
+ # Baichuan-13B-Chat template
948
+ register_conv_template(
949
+ # source: https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/19ef51ba5bad8935b03acd20ff04a269210983bc/modeling_baichuan.py#L555
950
+ # https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/main/generation_config.json
951
+ # https://github.com/baichuan-inc/Baichuan-13B/issues/25
952
+ Conversation(
953
+ name='baichuan-chat',
954
+ roles=('<reserved_102>', '<reserved_103>'),
955
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
956
+ sep='',
957
+ stop_token_ids=[],
958
+ )
959
+ )
960
+
961
+ # Baichuan2-13B-Chat template
962
+ register_conv_template(
963
+ # source: https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/c6f8592a60b4ad73c210b28dd2ab3cca51abbf93/modeling_baichuan.py#L773
964
+ # https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json
965
+ # https://github.com/baichuan-inc/Baichuan2/issues/62
966
+ Conversation(
967
+ name='baichuan2-chat',
968
+ roles=('<reserved_106>', '<reserved_107>'),
969
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
970
+ sep='',
971
+ stop_token_ids=[],
972
+ )
973
+ )
974
+
975
+ # Mistral template
976
+ # source: https://docs.mistral.ai/llm/mistral-instruct-v0.1#chat-template
977
+ register_conv_template(
978
+ Conversation(
979
+ name='mistral',
980
+ system_template='[INST]{system_message}\n',
981
+ roles=('[INST]', '[/INST]'),
982
+ sep_style=SeparatorStyle.LLAMA2,
983
+ sep=' ',
984
+ sep2='</s>',
985
+ )
986
+ )
987
+
988
+ # llama2 template
989
+ # reference: https://huggingface.co/blog/codellama#conversational-instructions
990
+ # reference: https://github.com/facebookresearch/llama/blob/1a240688810f8036049e8da36b073f63d2ac552c/llama/generation.py#L212
991
+ register_conv_template(
992
+ Conversation(
993
+ name='llama-2',
994
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
995
+ roles=('[INST]', '[/INST]'),
996
+ sep_style=SeparatorStyle.LLAMA2,
997
+ sep=' ',
998
+ sep2=' </s><s>',
999
+ )
1000
+ )
1001
+
1002
+ register_conv_template(
1003
+ Conversation(
1004
+ name='cutegpt',
1005
+ roles=('问:', '答:\n'),
1006
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1007
+ sep='\n',
1008
+ sep2='\n',
1009
+ stop_str='<end>',
1010
+ )
1011
+ )
1012
+
1013
+ # OpenOrcaxOpenChat-naPreview2-13B template
1014
+ register_conv_template(
1015
+ Conversation(
1016
+ name='open-orca',
1017
+ system_template='{system_message}',
1018
+ system_message='You are a helpful assistant. Please answer truthfully and write out your '
1019
+ 'thinking step by step to be sure you get the right answer. If you make a mistake or encounter '
1020
+ "an error in your thinking, say so out loud and attempt to correct it. If you don't know or "
1021
+ "aren't sure about something, say so clearly. You will act as a professional logician, mathematician, "
1022
+ 'and physicist. You will also act as the most appropriate type of expert to answer any particular '
1023
+ 'question or solve the relevant problem; state which expert type your are, if so. Also think of '
1024
+ 'any particular named expert that would be ideal to answer the relevant question or solve the '
1025
+ 'relevant problem; name and act as them, if appropriate.',
1026
+ roles=('User', 'Assistant'),
1027
+ sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
1028
+ sep='<|end_of_turn|>\n',
1029
+ stop_token_ids=[32000, 32001], # "<|end_of_turn|>"
1030
+ stop_str='User',
1031
+ )
1032
+ )
1033
+
1034
+ # Open-Orca/Mistral-7B-OpenOrca template
1035
+ # source: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca
1036
+ # reference: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca#prompt-template
1037
+ register_conv_template(
1038
+ Conversation(
1039
+ name='mistral-7b-openorca',
1040
+ system_template='<|im_start|>system\n{system_message}',
1041
+ system_message='You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!',
1042
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
1043
+ sep_style=SeparatorStyle.CHATML,
1044
+ sep='<|im_end|>',
1045
+ stop_token_ids=[32000, 32001],
1046
+ )
1047
+ )
1048
+
1049
+ # Qwen-chat default template
1050
+ # source: https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/qwen_generation_utils.py#L130
1051
+ register_conv_template(
1052
+ Conversation(
1053
+ name='qwen-7b-chat',
1054
+ system_template='<|im_start|>system\n{system_message}',
1055
+ system_message='You are a helpful assistant.',
1056
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
1057
+ sep_style=SeparatorStyle.CHATML,
1058
+ sep='<|im_end|>',
1059
+ stop_token_ids=[
1060
+ 151643,
1061
+ 151644,
1062
+ 151645,
1063
+ ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>"
1064
+ stop_str='<|endoftext|>',
1065
+ )
1066
+ )
1067
+
1068
+
1069
+ # AquilaChat default template
1070
+ # source: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-chat/cyg_conversation.py
1071
+ register_conv_template(
1072
+ Conversation(
1073
+ name='aquila-chat',
1074
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1075
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1076
+ roles=('Human', 'Assistant'),
1077
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1078
+ sep='###',
1079
+ sep2='',
1080
+ stop_str=['###', '</s>', '[UNK]'],
1081
+ )
1082
+ )
1083
+ # AquilaChat2-34B default template
1084
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L212
1085
+ register_conv_template(
1086
+ Conversation(
1087
+ name='aquila-legacy',
1088
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1089
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
1090
+ roles=('### Human: ', '### Assistant: '),
1091
+ offset=0,
1092
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1093
+ sep='\n',
1094
+ sep2='</s>',
1095
+ stop_str=['</s>', '[UNK]'],
1096
+ )
1097
+ )
1098
+ # AquilaChat2-7B-16K and AquilaChat2-34B-16K default template
1099
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L227
1100
+ register_conv_template(
1101
+ Conversation(
1102
+ name='aquila',
1103
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1104
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1105
+ roles=('Human', 'Assistant'),
1106
+ offset=0,
1107
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1108
+ sep='###',
1109
+ sep2='</s>',
1110
+ stop_str=['</s>', '[UNK]'],
1111
+ )
1112
+ )
1113
+
1114
+ # AquilaChat2-7B default template
1115
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L242
1116
+ register_conv_template(
1117
+ Conversation(
1118
+ name='aquila-v1',
1119
+ roles=('<|startofpiece|>', '<|endofpiece|>'),
1120
+ offset=0,
1121
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1122
+ sep='',
1123
+ sep2='</s>',
1124
+ stop_str=['</s>', '<|endoftext|>'],
1125
+ )
1126
+ )
1127
+
1128
+ # Llama2-Chinese default template
1129
+ # source: https://huggingface.co/FlagAlpha
1130
+ register_conv_template(
1131
+ Conversation(
1132
+ name='llama2-chinese',
1133
+ system_template='<s>{system_message}</s>',
1134
+ roles=('Human', 'Assistant', 'System'),
1135
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1136
+ sep='\n',
1137
+ sep2='\n</s><s>',
1138
+ stop_str='</s>',
1139
  )
1140
  )
1141
+
1142
+ # Vigogne Instruct default template
1143
+ # source: https://github.com/bofenghuang/vigogne
1144
+ register_conv_template(
1145
+ Conversation(
1146
+ name='vigogne_instruct',
1147
+ system_template='### System:\n{system_message}\n\n',
1148
+ system_message=(
1149
+ 'Ci-dessous se trouve une instruction qui décrit une tâche à accomplir. Rédigez une réponse qui répond de manière'
1150
+ ' précise à la demande.'
1151
+ ),
1152
+ roles=('### Instruction', '### Response'),
1153
+ sep_style=SeparatorStyle.DOLLY,
1154
+ sep='\n\n',
1155
+ sep2='</s>',
1156
+ )
1157
+ )
1158
+
1159
+ # Vigogne Chat default template
1160
+ register_conv_template(
1161
+ Conversation(
1162
+ name='vigogne_chat_v2',
1163
+ system_template='<|system|>: {system_message}',
1164
+ system_message=(
1165
+ 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1166
+ ' autant que vous le pouvez.'
1167
+ ),
1168
+ roles=('<|user|>', '<|assistant|>'),
1169
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1170
+ sep='\n',
1171
+ sep2='</s>\n',
1172
+ stop_str='<|user|>',
1173
+ )
1174
+ )
1175
+
1176
+ register_conv_template(
1177
+ Conversation(
1178
+ name='vigogne_chat_v3',
1179
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
1180
+ system_message=(
1181
+ 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1182
+ ' autant que vous le pouvez.'
1183
+ ),
1184
+ roles=('[INST]', '[/INST]'),
1185
+ sep_style=SeparatorStyle.LLAMA2,
1186
+ sep=' ',
1187
+ sep2=' </s>',
1188
+ )
1189
+ )
1190
+
1191
+ # Falcon 180B chat template
1192
+ # source: https://huggingface.co/spaces/tiiuae/falcon-180b-demo/blob/d1590ee7fae9b6ce331ba7808e61a29dcce9239f/app.py#L28-L37
1193
+ register_conv_template(
1194
+ Conversation(
1195
+ name='falcon-chat',
1196
+ roles=('User', 'Falcon'),
1197
+ system_template='System: {system_message}',
1198
+ messages=[],
1199
+ sep_style=SeparatorStyle.FALCON_CHAT,
1200
+ sep='\n',
1201
+ sep2='<|endoftext|>',
1202
+ stop_str='\nUser:', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
1203
+ )
1204
+ )
1205
+
1206
+ # Phind template
1207
+ # source: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
1208
+ register_conv_template(
1209
+ Conversation(
1210
+ name='phind',
1211
+ system_message='### System Prompt\nYou are an intelligent programming assistant.',
1212
+ roles=('### User Message', '### Assistant'),
1213
+ messages=(),
1214
+ offset=0,
1215
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1216
+ sep='\n\n',
1217
+ )
1218
+ )
1219
+
1220
+ # Metharme formatting for Pygmalion models
1221
+ # source: https://huggingface.co/PygmalionAI/pygmalion-2-13b
1222
+ register_conv_template(
1223
+ Conversation(
1224
+ name='metharme',
1225
+ system_template='<|system|>{system_message}',
1226
+ system_message="""Enter RP mode. You shall reply to the user while staying
1227
+ in character. Your responses must be detailed, creative, immersive, and drive the scenario
1228
+ forward.""",
1229
+ roles=('<|user|>', '<|model|>'),
1230
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
1231
+ sep='',
1232
+ stop_str='<|user|>',
1233
+ )
1234
+ )
1235
+
1236
+ # Zephyr template
1237
+ # reference: https://huggingface.co/spaces/HuggingFaceH4/zephyr-playground/blob/main/dialogues.py
1238
+ register_conv_template(
1239
+ Conversation(
1240
+ name='zephyr',
1241
+ system_template='<|system|>\n{system_message}',
1242
+ roles=('<|user|>', '<|assistant|>'),
1243
+ sep_style=SeparatorStyle.CHATML,
1244
+ sep='</s>',
1245
+ stop_token_ids=[2],
1246
+ stop_str='</s>',
1247
+ )
1248
+ )
1249
+
1250
+ # InternVL-ZH template
1251
+ register_conv_template(
1252
+ Conversation(
1253
+ name='internvl_zh',
1254
+ system_template='',
1255
+ roles=('<human>', '<bot>'),
1256
+ sep_style=SeparatorStyle.INTERNVL_ZH,
1257
+ sep=' ',
1258
+ sep2='</s>',
1259
+ )
1260
+ )
1261
+
examples/image1.jpg DELETED
Binary file (78.1 kB)
 
examples/image2.jpg DELETED
Binary file (126 kB)
 
generation_config.json CHANGED
@@ -1,8 +1,4 @@
1
  {
2
  "_from_model_config": true,
3
- "transformers_version": "4.37.2",
4
- "eos_token_id": [
5
- 92542,
6
- 92543
7
- ]
8
  }
 
1
  {
2
  "_from_model_config": true,
3
+ "transformers_version": "4.36.2"
 
 
 
 
4
  }
modeling_intern_vit.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  from typing import Optional, Tuple, Union
@@ -20,12 +20,18 @@ from transformers.utils import logging
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
 
 
 
 
 
 
 
23
  from flash_attn.bert_padding import pad_input, unpad_input
24
- from flash_attn.flash_attn_interface import \
25
- flash_attn_varlen_qkvpacked_func
26
  has_flash_attn = True
27
  except:
28
- print('FlashAttention2 is not installed.')
29
  has_flash_attn = False
30
 
31
  logger = logging.get_logger(__name__)
@@ -41,12 +47,12 @@ class FlashAttention(nn.Module):
41
  attention_dropout: The dropout rate to apply to the attention
42
  (default: 0.0)
43
  """
44
-
45
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
46
  super().__init__()
47
  self.softmax_scale = softmax_scale
48
  self.dropout_p = attention_dropout
49
-
50
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
51
  max_s=None, need_weights=False):
52
  """Implements the multihead softmax attention.
@@ -59,7 +65,7 @@ class FlashAttention(nn.Module):
59
  assert not need_weights
60
  assert qkv.dtype in [torch.float16, torch.bfloat16]
61
  assert qkv.is_cuda
62
-
63
  if cu_seqlens is None:
64
  batch_size = qkv.shape[0]
65
  seqlen = qkv.shape[1]
@@ -68,7 +74,7 @@ class FlashAttention(nn.Module):
68
  max_s = seqlen
69
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
  device=qkv.device)
71
- output = flash_attn_varlen_qkvpacked_func(
72
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
  softmax_scale=self.softmax_scale, causal=causal
74
  )
@@ -78,7 +84,7 @@ class FlashAttention(nn.Module):
78
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
- output_unpad = flash_attn_varlen_qkvpacked_func(
82
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
  softmax_scale=self.softmax_scale, causal=causal
84
  )
@@ -87,11 +93,11 @@ class FlashAttention(nn.Module):
87
  'b s (h d) -> b s h d', h=nheads)
88
  else:
89
  assert max_s is not None
90
- output = flash_attn_varlen_qkvpacked_func(
91
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
  softmax_scale=self.softmax_scale, causal=causal
93
  )
94
-
95
  return output, None
96
 
97
 
@@ -123,12 +129,6 @@ except Exception:
123
  pass
124
 
125
 
126
- NORM2FN = {
127
- 'rms_norm': InternRMSNorm,
128
- 'layer_norm': nn.LayerNorm,
129
- }
130
-
131
-
132
  class InternVisionEmbeddings(nn.Module):
133
  def __init__(self, config: InternVisionConfig):
134
  super().__init__()
@@ -154,7 +154,7 @@ class InternVisionEmbeddings(nn.Module):
154
  target_dtype = pos_embed.dtype
155
  pos_embed = pos_embed.float().reshape(
156
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
157
- pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
158
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
159
  return pos_embed
160
 
@@ -267,12 +267,11 @@ class InternVisionEncoderLayer(nn.Module):
267
  super().__init__()
268
  self.embed_dim = config.hidden_size
269
  self.intermediate_size = config.intermediate_size
270
- self.norm_type = config.norm_type
271
 
272
  self.attn = InternAttention(config)
273
  self.mlp = InternMLP(config)
274
- self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
275
- self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
276
 
277
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
278
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
@@ -287,9 +286,9 @@ class InternVisionEncoderLayer(nn.Module):
287
  Args:
288
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
289
  """
290
- hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
291
 
292
- hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states).to(hidden_states.dtype)) * self.ls2)
293
 
294
  return hidden_states
295
 
@@ -362,7 +361,6 @@ class InternVisionEncoder(nn.Module):
362
 
363
  class InternVisionModel(PreTrainedModel):
364
  main_input_name = 'pixel_values'
365
- _supports_flash_attn_2 = True
366
  config_class = InternVisionConfig
367
  _no_split_modules = ['InternVisionEncoderLayer']
368
 
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  from typing import Optional, Tuple, Union
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
+ try: # v1
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_unpadded_qkvpacked_func
26
+ except: # v2
27
+ from flash_attn.flash_attn_interface import \
28
+ flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
+
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
+
 
32
  has_flash_attn = True
33
  except:
34
+ print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
 
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
+
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
+
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
 
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
+
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
 
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
+ output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
 
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
+ output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
 
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
+ output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
+
101
  return output, None
102
 
103
 
 
129
  pass
130
 
131
 
 
 
 
 
 
 
132
  class InternVisionEmbeddings(nn.Module):
133
  def __init__(self, config: InternVisionConfig):
134
  super().__init__()
 
154
  target_dtype = pos_embed.dtype
155
  pos_embed = pos_embed.float().reshape(
156
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
157
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
158
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
159
  return pos_embed
160
 
 
267
  super().__init__()
268
  self.embed_dim = config.hidden_size
269
  self.intermediate_size = config.intermediate_size
 
270
 
271
  self.attn = InternAttention(config)
272
  self.mlp = InternMLP(config)
273
+ self.norm1 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
274
+ self.norm2 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
275
 
276
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
277
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
 
286
  Args:
287
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
288
  """
289
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
290
 
291
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states)) * self.ls2)
292
 
293
  return hidden_states
294
 
 
361
 
362
  class InternVisionModel(PreTrainedModel):
363
  main_input_name = 'pixel_values'
 
364
  config_class = InternVisionConfig
365
  _no_split_modules = ['InternVisionEncoderLayer']
366
 
modeling_internlm2.py CHANGED
@@ -48,18 +48,6 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
51
- try:
52
- from flash_attn import flash_attn_func as _flash_attn_func
53
- from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
- from flash_attn.bert_padding import index_first_axis as _index_first_axis
55
- from flash_attn.bert_padding import pad_input as _pad_input
56
- from flash_attn.bert_padding import unpad_input as _unpad_input
57
-
58
- flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
- pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
- has_flash_attn = True
61
- except:
62
- has_flash_attn = False
63
 
64
 
65
  def _import_flash_attn():
@@ -161,7 +149,7 @@ class InternLM2RotaryEmbedding(nn.Module):
161
 
162
  def _set_cos_sin_cache(self, seq_len, device, dtype):
163
  self.max_seq_len_cached = seq_len
164
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
 
166
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -190,7 +178,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
190
 
191
  def _set_cos_sin_cache(self, seq_len, device, dtype):
192
  self.max_seq_len_cached = seq_len
193
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
  t = t / self.scaling_factor
195
 
196
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -220,7 +208,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
220
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
  self.register_buffer('inv_freq', inv_freq, persistent=False)
222
 
223
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
 
225
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -709,7 +697,6 @@ class InternLM2PreTrainedModel(PreTrainedModel):
709
  supports_gradient_checkpointing = True
710
  _no_split_modules = ['InternLM2DecoderLayer']
711
  _skip_keys_device_placement = 'past_key_values'
712
- _supports_flash_attn_2 = True
713
 
714
  def _init_weights(self, module):
715
  std = self.config.initializer_range
@@ -808,9 +795,6 @@ class InternLM2Model(InternLM2PreTrainedModel):
808
  self.padding_idx = config.pad_token_id
809
  self.vocab_size = config.vocab_size
810
  self.config = config
811
- if not has_flash_attn:
812
- self.config.attn_implementation = 'eager'
813
- print('Warning: Flash attention is not available, using eager attention instead.')
814
 
815
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
816
 
@@ -1098,16 +1082,13 @@ class InternLM2ForCausalLM(InternLM2PreTrainedModel):
1098
  output = (logits,) + outputs[1:]
1099
  return (loss,) + output if loss is not None else output
1100
 
1101
- device = input_ids.device if input_ids is not None else inputs_embeds.device
1102
- output = CausalLMOutputWithPast(
1103
  loss=loss,
1104
  logits=logits,
1105
  past_key_values=outputs.past_key_values,
1106
  hidden_states=outputs.hidden_states,
1107
  attentions=outputs.attentions,
1108
  )
1109
- output['logits'] = output['logits'].to(device)
1110
- return output
1111
 
1112
  def prepare_inputs_for_generation(
1113
  self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
 
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
 
53
  def _import_flash_attn():
 
149
 
150
  def _set_cos_sin_cache(self, seq_len, device, dtype):
151
  self.max_seq_len_cached = seq_len
152
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
153
 
154
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
155
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
178
 
179
  def _set_cos_sin_cache(self, seq_len, device, dtype):
180
  self.max_seq_len_cached = seq_len
181
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
182
  t = t / self.scaling_factor
183
 
184
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
 
208
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
209
  self.register_buffer('inv_freq', inv_freq, persistent=False)
210
 
211
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
212
 
213
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
214
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
697
  supports_gradient_checkpointing = True
698
  _no_split_modules = ['InternLM2DecoderLayer']
699
  _skip_keys_device_placement = 'past_key_values'
 
700
 
701
  def _init_weights(self, module):
702
  std = self.config.initializer_range
 
795
  self.padding_idx = config.pad_token_id
796
  self.vocab_size = config.vocab_size
797
  self.config = config
 
 
 
798
 
799
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
800
 
 
1082
  output = (logits,) + outputs[1:]
1083
  return (loss,) + output if loss is not None else output
1084
 
1085
+ return CausalLMOutputWithPast(
 
1086
  loss=loss,
1087
  logits=logits,
1088
  past_key_values=outputs.past_key_values,
1089
  hidden_states=outputs.hidden_states,
1090
  attentions=outputs.attentions,
1091
  )
 
 
1092
 
1093
  def prepare_inputs_for_generation(
1094
  self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
modeling_internvl_chat.py CHANGED
@@ -1,46 +1,70 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import warnings
7
  from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
- import transformers
11
  from torch import nn
12
  from torch.nn import CrossEntropyLoss
13
- from transformers import AutoModel, GenerationConfig, LlamaForCausalLM
 
14
  from transformers.modeling_outputs import CausalLMOutputWithPast
15
  from transformers.modeling_utils import PreTrainedModel
16
  from transformers.utils import ModelOutput, logging
17
 
18
  from .configuration_internvl_chat import InternVLChatConfig
19
- from .conversation import get_conv_template
20
- from .modeling_intern_vit import InternVisionModel, has_flash_attn
21
  from .modeling_internlm2 import InternLM2ForCausalLM
22
 
23
  logger = logging.get_logger(__name__)
24
 
25
 
26
- def version_cmp(v1, v2, op='eq'):
27
- import operator
 
 
 
28
 
29
- from packaging import version
30
- op_func = getattr(operator, op)
31
- return op_func(version.parse(v1), version.parse(v2))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
 
34
  class InternVLChatModel(PreTrainedModel):
35
  config_class = InternVLChatConfig
36
  main_input_name = 'pixel_values'
37
- _supports_flash_attn_2 = True
38
- _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
39
 
40
- def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None, use_flash_attn=True):
41
  super().__init__(config)
42
 
43
- assert version_cmp(transformers.__version__, '4.36.2', 'ge')
44
  image_size = config.force_image_size or config.vision_config.image_size
45
  patch_size = config.vision_config.patch_size
46
  self.patch_size = patch_size
@@ -48,10 +72,8 @@ class InternVLChatModel(PreTrainedModel):
48
  self.template = config.template
49
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
50
  self.downsample_ratio = config.downsample_ratio
 
51
  self.ps_version = config.ps_version
52
- use_flash_attn = use_flash_attn if has_flash_attn else False
53
- config.vision_config.use_flash_attn = True if use_flash_attn else False
54
- config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
55
 
56
  logger.info(f'num_image_token: {self.num_image_token}')
57
  logger.info(f'ps_version: {self.ps_version}')
@@ -79,9 +101,44 @@ class InternVLChatModel(PreTrainedModel):
79
  nn.Linear(llm_hidden_size, llm_hidden_size)
80
  )
81
 
 
 
 
 
 
 
 
82
  self.img_context_token_id = None
83
- self.conv_template = get_conv_template(self.template)
84
- self.system_message = self.conv_template.system_message
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  def forward(
87
  self,
@@ -178,7 +235,17 @@ class InternVLChatModel(PreTrainedModel):
178
  x = x.permute(0, 2, 1, 3).contiguous()
179
  return x
180
 
 
 
 
 
 
 
181
  def extract_feature(self, pixel_values):
 
 
 
 
182
  if self.select_layer == -1:
183
  vit_embeds = self.vision_model(
184
  pixel_values=pixel_values,
@@ -191,96 +258,50 @@ class InternVLChatModel(PreTrainedModel):
191
  return_dict=True).hidden_states[self.select_layer]
192
  vit_embeds = vit_embeds[:, 1:, :]
193
 
 
 
 
 
 
 
 
 
 
194
  h = w = int(vit_embeds.shape[1] ** 0.5)
195
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
196
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
197
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
 
 
198
  vit_embeds = self.mlp1(vit_embeds)
199
  return vit_embeds
200
 
201
- def batch_chat(self, tokenizer, pixel_values, questions, generation_config, num_patches_list=None,
202
- history=None, return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
203
- IMG_CONTEXT_TOKEN='<IMG_CONTEXT>', verbose=False, image_counts=None):
204
- if history is not None or return_history:
205
- print('Now multi-turn chat is not supported in batch_chat.')
206
- raise NotImplementedError
207
-
208
- if image_counts is not None:
209
- num_patches_list = image_counts
210
- print('Warning: `image_counts` is deprecated. Please use `num_patches_list` instead.')
211
-
212
- img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
213
- self.img_context_token_id = img_context_token_id
214
-
215
- if verbose and pixel_values is not None:
216
- image_bs = pixel_values.shape[0]
217
- print(f'dynamic ViT batch size: {image_bs}')
218
-
219
- queries = []
220
- for idx, num_patches in enumerate(num_patches_list):
221
- question = questions[idx]
222
- if pixel_values is not None and '<image>' not in question:
223
- question = '<image>\n' + question
224
- template = get_conv_template(self.template)
225
- template.system_message = self.system_message
226
- template.append_message(template.roles[0], question)
227
- template.append_message(template.roles[1], None)
228
- query = template.get_prompt()
229
-
230
- image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
231
- query = query.replace('<image>', image_tokens, 1)
232
- queries.append(query)
233
-
234
- tokenizer.padding_side = 'left'
235
- model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
236
- input_ids = model_inputs['input_ids'].cuda()
237
- attention_mask = model_inputs['attention_mask'].cuda()
238
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
239
- generation_config['eos_token_id'] = eos_token_id
240
- generation_output = self.generate(
241
- pixel_values=pixel_values,
242
- input_ids=input_ids,
243
- attention_mask=attention_mask,
244
- **generation_config
245
- )
246
- responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
247
- responses = [response.split(template.sep)[0].strip() for response in responses]
248
- return responses
249
-
250
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
251
- num_patches_list=None, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
252
- verbose=False):
253
-
254
- if history is None and pixel_values is not None and '<image>' not in question:
255
- question = '<image>\n' + question
256
-
257
- if num_patches_list is None:
258
- num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
259
- assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
260
 
261
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
262
  self.img_context_token_id = img_context_token_id
 
 
 
 
263
 
264
- template = get_conv_template(self.template)
265
- template.system_message = self.system_message
266
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
267
 
268
- history = [] if history is None else history
269
- for (old_question, old_answer) in history:
270
- template.append_message(template.roles[0], old_question)
271
- template.append_message(template.roles[1], old_answer)
 
 
 
 
 
 
 
272
  template.append_message(template.roles[0], question)
273
  template.append_message(template.roles[1], None)
274
  query = template.get_prompt()
275
-
276
- if verbose and pixel_values is not None:
277
- image_bs = pixel_values.shape[0]
278
- print(f'dynamic ViT batch size: {image_bs}')
279
-
280
- for num_patches in num_patches_list:
281
- image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
282
- query = query.replace('<image>', image_tokens, 1)
283
-
284
  model_inputs = tokenizer(query, return_tensors='pt')
285
  input_ids = model_inputs['input_ids'].cuda()
286
  attention_mask = model_inputs['attention_mask'].cuda()
@@ -292,16 +313,15 @@ class InternVLChatModel(PreTrainedModel):
292
  **generation_config
293
  )
294
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
295
- response = response.split(template.sep)[0].strip()
296
  history.append((question, response))
297
  if return_history:
298
  return response, history
299
  else:
300
- query_to_print = query.replace(IMG_CONTEXT_TOKEN, '')
301
- query_to_print = query_to_print.replace(f'{IMG_START_TOKEN}{IMG_END_TOKEN}', '<image>')
302
- if verbose:
303
- print(query_to_print, response)
304
  return response
 
305
 
306
  @torch.no_grad()
307
  def generate(
@@ -322,6 +342,7 @@ class InternVLChatModel(PreTrainedModel):
322
  vit_embeds = visual_features
323
  else:
324
  vit_embeds = self.extract_feature(pixel_values)
 
325
  input_embeds = self.language_model.get_input_embeddings()(input_ids)
326
  B, N, C = input_embeds.shape
327
  input_embeds = input_embeds.reshape(B * N, C)
@@ -329,7 +350,7 @@ class InternVLChatModel(PreTrainedModel):
329
  input_ids = input_ids.reshape(B * N)
330
  selected = (input_ids == self.img_context_token_id)
331
  assert selected.sum() != 0
332
- input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device)
333
 
334
  input_embeds = input_embeds.reshape(B, N, C)
335
  else:
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import warnings
7
  from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
+ from peft import LoraConfig, get_peft_model
11
  from torch import nn
12
  from torch.nn import CrossEntropyLoss
13
+ from transformers import (AutoModel, GenerationConfig, LlamaForCausalLM,
14
+ LlamaTokenizer)
15
  from transformers.modeling_outputs import CausalLMOutputWithPast
16
  from transformers.modeling_utils import PreTrainedModel
17
  from transformers.utils import ModelOutput, logging
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
20
+ from .modeling_intern_vit import InternVisionModel
 
21
  from .modeling_internlm2 import InternLM2ForCausalLM
22
 
23
  logger = logging.get_logger(__name__)
24
 
25
 
26
+ def window_partition(x, window_size):
27
+ """
28
+ Args:
29
+ x: (B, C, H, W)
30
+ window_size (int): window size, assuming square window
31
 
32
+ Returns:
33
+ windows: (num_windows*B, C, window_size, window_size)
34
+ """
35
+ B, C, H, W = x.shape
36
+ assert H % window_size == 0 and W % window_size == 0, 'H and W must be divisible by window_size'
37
+
38
+ x = x.view(B, C, H // window_size, window_size, W // window_size, window_size)
39
+ windows = x.permute(0, 2, 4, 1, 3, 5).contiguous().view(-1, C, window_size, window_size)
40
+ return windows
41
+
42
+
43
+ def window_reverse(windows, window_size, H, W):
44
+ """
45
+ Args:
46
+ windows: (num_windows*B, window_size, window_size, C)
47
+ window_size (int): Window size
48
+ H (int): Height of image
49
+ W (int): Width of image
50
+
51
+ Returns:
52
+ x: (B, H * W, C)
53
+ """
54
+ B = int(windows.shape[0] / (H * W / window_size / window_size))
55
+ x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
56
+ x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H * W, -1)
57
+ return x
58
 
59
 
60
  class InternVLChatModel(PreTrainedModel):
61
  config_class = InternVLChatConfig
62
  main_input_name = 'pixel_values'
63
+ _no_split_modules = ['InternVisionEncoderLayer', 'LlamaDecoderLayer', 'LlamaForCausalLM']
 
64
 
65
+ def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None):
66
  super().__init__(config)
67
 
 
68
  image_size = config.force_image_size or config.vision_config.image_size
69
  patch_size = config.vision_config.patch_size
70
  self.patch_size = patch_size
 
72
  self.template = config.template
73
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
74
  self.downsample_ratio = config.downsample_ratio
75
+ self.image_fold = config.image_fold
76
  self.ps_version = config.ps_version
 
 
 
77
 
78
  logger.info(f'num_image_token: {self.num_image_token}')
79
  logger.info(f'ps_version: {self.ps_version}')
 
101
  nn.Linear(llm_hidden_size, llm_hidden_size)
102
  )
103
 
104
+ # if config.force_image_size != config.vision_config.image_size:
105
+ # self.vision_model.resize_pos_embeddings(
106
+ # old_size=config.vision_config.image_size,
107
+ # new_size=config.force_image_size,
108
+ # patch_size=config.vision_config.patch_size
109
+ # )
110
+
111
  self.img_context_token_id = None
112
+ self.neftune_alpha = None
113
+
114
+ if config.use_backbone_lora:
115
+ self.wrap_backbone_lora(r=config.use_backbone_lora, lora_alpha=2 * config.use_backbone_lora)
116
+
117
+ if config.use_llm_lora:
118
+ self.wrap_llm_lora(r=config.use_llm_lora, lora_alpha=2 * config.use_llm_lora)
119
+
120
+ def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
121
+ lora_config = LoraConfig(
122
+ r=r,
123
+ target_modules=['attn.qkv', 'attn.proj', 'mlp.fc1', 'mlp.fc2'],
124
+ lora_alpha=lora_alpha,
125
+ lora_dropout=lora_dropout,
126
+ )
127
+ self.vision_model = get_peft_model(self.vision_model, lora_config)
128
+ self.vision_model.print_trainable_parameters()
129
+
130
+ def wrap_llm_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
131
+ lora_config = LoraConfig(
132
+ r=r,
133
+ target_modules=['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj',
134
+ 'mlp.gate_proj', 'mlp.down_proj', 'mlp.up_proj'],
135
+ lora_alpha=lora_alpha,
136
+ lora_dropout=lora_dropout,
137
+ task_type='CAUSAL_LM'
138
+ )
139
+ self.language_model = get_peft_model(self.language_model, lora_config)
140
+ self.language_model.enable_input_require_grads()
141
+ self.language_model.print_trainable_parameters()
142
 
143
  def forward(
144
  self,
 
235
  x = x.permute(0, 2, 1, 3).contiguous()
236
  return x
237
 
238
+ def noised_embed(self, vit_embeds, noise_alpha=5):
239
+ dims = torch.tensor(vit_embeds.size(1) * vit_embeds.size(2))
240
+ mag_norm = noise_alpha / torch.sqrt(dims)
241
+ noise = torch.zeros_like(vit_embeds).uniform_(-mag_norm, mag_norm)
242
+ return vit_embeds + noise
243
+
244
  def extract_feature(self, pixel_values):
245
+ if self.image_fold:
246
+ image_size = pixel_values.size(-1) # B, C, H, W
247
+ pixel_values = window_partition(pixel_values, window_size=image_size // self.image_fold) # 4B, C, H/2, W/2
248
+
249
  if self.select_layer == -1:
250
  vit_embeds = self.vision_model(
251
  pixel_values=pixel_values,
 
258
  return_dict=True).hidden_states[self.select_layer]
259
  vit_embeds = vit_embeds[:, 1:, :]
260
 
261
+ if self.training and self.neftune_alpha is not None:
262
+ vit_embeds = self.noised_embed(vit_embeds, self.neftune_alpha)
263
+
264
+ if self.image_fold:
265
+ vit_embeds = window_reverse(vit_embeds, window_size=image_size // (self.image_fold * self.patch_size),
266
+ H=image_size // self.patch_size, W=image_size // self.patch_size)
267
+
268
+ # if torch.distributed.get_rank() == 0:
269
+ # print("before pixel shuffle:", vit_embeds.shape)
270
  h = w = int(vit_embeds.shape[1] ** 0.5)
271
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
272
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
273
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
274
+ # if torch.distributed.get_rank() == 0:
275
+ # print("after pixel shuffle:", vit_embeds.shape)
276
  vit_embeds = self.mlp1(vit_embeds)
277
  return vit_embeds
278
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
280
+ IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
 
 
 
 
 
 
 
 
281
 
282
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
283
  self.img_context_token_id = img_context_token_id
284
+ if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
285
+ eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
286
+ else:
287
+ eos_token_id = tokenizer.eos_token_id
288
 
289
+ from .conversation import get_conv_template
 
 
290
 
291
+ template = get_conv_template(self.template)
292
+ image_bs = pixel_values.shape[0]
293
+ print(f'dynamic ViT batch size: {image_bs}')
294
+ if history is None:
295
+ history = []
296
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
297
+ question = image_tokens + '\n' + question
298
+ else:
299
+ for (old_question, old_answer) in history:
300
+ template.append_message(template.roles[0], old_question)
301
+ template.append_message(template.roles[1], old_answer)
302
  template.append_message(template.roles[0], question)
303
  template.append_message(template.roles[1], None)
304
  query = template.get_prompt()
 
 
 
 
 
 
 
 
 
305
  model_inputs = tokenizer(query, return_tensors='pt')
306
  input_ids = model_inputs['input_ids'].cuda()
307
  attention_mask = model_inputs['attention_mask'].cuda()
 
313
  **generation_config
314
  )
315
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
316
+ response = response.split('<|im_end|>')[0].strip() # for InternLM2
317
  history.append((question, response))
318
  if return_history:
319
  return response, history
320
  else:
321
+ query_to_print = query.replace(image_tokens, '<image>')
322
+ print(query_to_print, response)
 
 
323
  return response
324
+ return response
325
 
326
  @torch.no_grad()
327
  def generate(
 
342
  vit_embeds = visual_features
343
  else:
344
  vit_embeds = self.extract_feature(pixel_values)
345
+
346
  input_embeds = self.language_model.get_input_embeddings()(input_ids)
347
  B, N, C = input_embeds.shape
348
  input_embeds = input_embeds.reshape(B * N, C)
 
350
  input_ids = input_ids.reshape(B * N)
351
  selected = (input_ids == self.img_context_token_id)
352
  assert selected.sum() != 0
353
+ input_embeds[selected] = vit_embeds.reshape(-1, C)
354
 
355
  input_embeds = input_embeds.reshape(B, N, C)
356
  else:
preprocessor_config.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "crop_size": 448,
3
- "do_center_crop": true,
4
- "do_normalize": true,
5
- "do_resize": true,
6
- "feature_extractor_type": "CLIPFeatureExtractor",
7
- "image_mean": [
8
- 0.485,
9
- 0.456,
10
- 0.406
11
- ],
12
- "image_std": [
13
- 0.229,
14
- 0.224,
15
- 0.225
16
- ],
17
- "resample": 3,
18
- "size": 448
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
examples/red-panda.mp4 → runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/events.out.tfevents.1713171220.SH-IDC1-10-140-37-13.204150.0 RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d921c07bb97224d65a37801541d246067f0d506f08723ffa1ad85c217907ccb8
3
- size 1867237
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:294d5bf755e6dea5c005c57af52e958a38bb42a7d17d801a25a6543bfe6ddca2
3
+ size 16662
runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/events.out.tfevents.1713174123.SH-IDC1-10-140-37-13.259480.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57d61c0e776bfb521e58febdbd99525e011f82137ceaaa655ffa6e2b3a9b02a9
3
+ size 72471