czczup commited on
Commit
0a46f73
1 Parent(s): fb211a0

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,51 +1,56 @@
1
  ---
2
  license: mit
3
  datasets:
4
- - laion/laion2B-en
5
- - laion/laion-coco
6
- - laion/laion2B-multi
7
- - kakaobrain/coyo-700m
8
- - conceptual_captions
9
- - wanng/wukong100m
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
- # Model Card for InternVL-Chat-V1.5-Int8
 
14
  <p align="center">
15
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
16
  </p>
17
 
18
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
- \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
  We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
23
- We introduce three simple designs:
24
- 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
25
- 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
26
- 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
27
 
 
 
 
28
 
29
  ## Model Details
 
30
  - **Model Type:** multimodal large language model (MLLM)
 
31
  - **Model Stats:**
 
32
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
33
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
34
  - Params: 25.5B
35
 
36
  - **Training Strategy:**
 
37
  - Learnable component in the pretraining stage: ViT + MLP
38
  - Learnable component in the finetuning stage: ViT + MLP + LLM
39
  - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
40
 
41
  ## Released Models
42
 
43
- | Model | Vision Foundation Model | Release Date |Note |
44
- | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
45
- | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
46
- | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
47
- | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
48
- | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
49
 
50
  ## Architecture
51
 
@@ -70,7 +75,7 @@ We introduce three simple designs:
70
 
71
  We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
72
 
73
- You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
74
 
75
  > Please use transformers==4.37.2 to ensure the model works normally.
76
 
@@ -231,7 +236,6 @@ responses = model.batch_chat(tokenizer, pixel_values,
231
  for question, response in zip(questions, responses):
232
  print(question)
233
  print(response)
234
-
235
  ```
236
 
237
  ## Citation
@@ -245,12 +249,18 @@ If you find this project useful in your research, please consider citing:
245
  journal={arXiv preprint arXiv:2312.14238},
246
  year={2023}
247
  }
 
 
 
 
 
 
248
  ```
249
 
250
  ## License
251
 
252
- This project is released under the MIT license.
253
 
254
  ## Acknowledgement
255
 
256
- InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
 
1
  ---
2
  license: mit
3
  datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1-5-Int8
14
+
15
  <p align="center">
16
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
17
  </p>
18
 
19
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
20
 
21
+ \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
22
 
23
  We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
24
+ We introduce three simple designs:
 
 
 
25
 
26
+ 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
27
+ 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
28
+ 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
29
 
30
  ## Model Details
31
+
32
  - **Model Type:** multimodal large language model (MLLM)
33
+
34
  - **Model Stats:**
35
+
36
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
37
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
38
  - Params: 25.5B
39
 
40
  - **Training Strategy:**
41
+
42
  - Learnable component in the pretraining stage: ViT + MLP
43
  - Learnable component in the finetuning stage: ViT + MLP + LLM
44
  - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
45
 
46
  ## Released Models
47
 
48
+ | Model | Vision Foundation Model | Release Date | Note |
49
+ | :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
50
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) | 2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
51
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.21 | more SFT data and stronger |
52
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.11 | scaling up LLM to 34B |
53
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) | 2024.01.24 | support Chinese and stronger OCR |
54
 
55
  ## Architecture
56
 
 
75
 
76
  We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
77
 
78
+ You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
79
 
80
  > Please use transformers==4.37.2 to ensure the model works normally.
81
 
 
236
  for question, response in zip(questions, responses):
237
  print(question)
238
  print(response)
 
239
  ```
240
 
241
  ## Citation
 
249
  journal={arXiv preprint arXiv:2312.14238},
250
  year={2023}
251
  }
252
+ @article{chen2024far,
253
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
254
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
255
+ journal={arXiv preprint arXiv:2404.16821},
256
+ year={2024}
257
+ }
258
  ```
259
 
260
  ## License
261
 
262
+ This project is released under the MIT license.
263
 
264
  ## Acknowledgement
265
 
266
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
config.json CHANGED
@@ -6,7 +6,8 @@
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
- "AutoModel": "modeling_internvl_chat.InternVLChatModel"
 
10
  },
11
  "downsample_ratio": 0.5,
12
  "dynamic_image_size": true,
 
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel",
10
+ "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
11
  },
12
  "downsample_ratio": 0.5,
13
  "dynamic_image_size": true,
configuration_intern_vit.py CHANGED
@@ -73,6 +73,7 @@ class InternVisionConfig(PretrainedConfig):
73
  num_hidden_layers=48,
74
  use_flash_attn=True,
75
  hidden_act='gelu',
 
76
  layer_norm_eps=1e-6,
77
  dropout=0.0,
78
  drop_path_rate=0.0,
@@ -97,6 +98,7 @@ class InternVisionConfig(PretrainedConfig):
97
  self.attention_dropout = attention_dropout
98
  self.layer_norm_eps = layer_norm_eps
99
  self.hidden_act = hidden_act
 
100
  self.qkv_bias = qkv_bias
101
  self.qk_normalization = qk_normalization
102
  self.use_flash_attn = use_flash_attn
 
73
  num_hidden_layers=48,
74
  use_flash_attn=True,
75
  hidden_act='gelu',
76
+ norm_type='rms_norm',
77
  layer_norm_eps=1e-6,
78
  dropout=0.0,
79
  drop_path_rate=0.0,
 
98
  self.attention_dropout = attention_dropout
99
  self.layer_norm_eps = layer_norm_eps
100
  self.hidden_act = hidden_act
101
+ self.norm_type = norm_type
102
  self.qkv_bias = qkv_bias
103
  self.qk_normalization = qk_normalization
104
  self.use_flash_attn = use_flash_attn
conversation.py CHANGED
@@ -1258,4 +1258,3 @@ register_conv_template(
1258
  sep2='</s>',
1259
  )
1260
  )
1261
-
 
1258
  sep2='</s>',
1259
  )
1260
  )
 
modeling_intern_vit.py CHANGED
@@ -26,9 +26,9 @@ try:
26
  except: # v2
27
  from flash_attn.flash_attn_interface import \
28
  flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
-
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
-
32
  has_flash_attn = True
33
  except:
34
  print('FlashAttention is not installed.')
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
-
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
-
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
-
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
-
101
  return output, None
102
 
103
 
@@ -129,6 +129,12 @@ except Exception:
129
  pass
130
 
131
 
 
 
 
 
 
 
132
  class InternVisionEmbeddings(nn.Module):
133
  def __init__(self, config: InternVisionConfig):
134
  super().__init__()
@@ -154,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
154
  target_dtype = pos_embed.dtype
155
  pos_embed = pos_embed.float().reshape(
156
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
157
- pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
158
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
159
  return pos_embed
160
 
@@ -267,11 +273,12 @@ class InternVisionEncoderLayer(nn.Module):
267
  super().__init__()
268
  self.embed_dim = config.hidden_size
269
  self.intermediate_size = config.intermediate_size
 
270
 
271
  self.attn = InternAttention(config)
272
  self.mlp = InternMLP(config)
273
- self.norm1 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
274
- self.norm2 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
275
 
276
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
277
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
 
26
  except: # v2
27
  from flash_attn.flash_attn_interface import \
28
  flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
+
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
+
32
  has_flash_attn = True
33
  except:
34
  print('FlashAttention is not installed.')
 
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
+
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
+
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
 
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
+
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
 
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
+
101
  return output, None
102
 
103
 
 
129
  pass
130
 
131
 
132
+ NORM2FN = {
133
+ 'rms_norm': InternRMSNorm,
134
+ 'layer_norm': nn.LayerNorm,
135
+ }
136
+
137
+
138
  class InternVisionEmbeddings(nn.Module):
139
  def __init__(self, config: InternVisionConfig):
140
  super().__init__()
 
160
  target_dtype = pos_embed.dtype
161
  pos_embed = pos_embed.float().reshape(
162
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
163
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
164
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
165
  return pos_embed
166
 
 
273
  super().__init__()
274
  self.embed_dim = config.hidden_size
275
  self.intermediate_size = config.intermediate_size
276
+ self.norm_type = config.norm_type
277
 
278
  self.attn = InternAttention(config)
279
  self.mlp = InternMLP(config)
280
+ self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
281
+ self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
282
 
283
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
284
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
modeling_internlm2.py CHANGED
@@ -48,6 +48,18 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
 
53
  def _import_flash_attn():
@@ -149,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
149
 
150
  def _set_cos_sin_cache(self, seq_len, device, dtype):
151
  self.max_seq_len_cached = seq_len
152
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
153
 
154
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
155
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -178,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
178
 
179
  def _set_cos_sin_cache(self, seq_len, device, dtype):
180
  self.max_seq_len_cached = seq_len
181
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
182
  t = t / self.scaling_factor
183
 
184
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -208,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
208
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
209
  self.register_buffer('inv_freq', inv_freq, persistent=False)
210
 
211
- t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
212
 
213
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
214
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -795,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
795
  self.padding_idx = config.pad_token_id
796
  self.vocab_size = config.vocab_size
797
  self.config = config
 
 
 
798
 
799
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
800
 
 
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
51
+ try:
52
+ from flash_attn import flash_attn_func as _flash_attn_func
53
+ from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis as _index_first_axis
55
+ from flash_attn.bert_padding import pad_input as _pad_input
56
+ from flash_attn.bert_padding import unpad_input as _unpad_input
57
+
58
+ flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
+ pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
+ has_flash_attn = True
61
+ except:
62
+ has_flash_attn = False
63
 
64
 
65
  def _import_flash_attn():
 
161
 
162
  def _set_cos_sin_cache(self, seq_len, device, dtype):
163
  self.max_seq_len_cached = seq_len
164
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
 
166
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
190
 
191
  def _set_cos_sin_cache(self, seq_len, device, dtype):
192
  self.max_seq_len_cached = seq_len
193
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
  t = t / self.scaling_factor
195
 
196
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
 
220
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
  self.register_buffer('inv_freq', inv_freq, persistent=False)
222
 
223
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
 
225
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
807
  self.padding_idx = config.pad_token_id
808
  self.vocab_size = config.vocab_size
809
  self.config = config
810
+ if not has_flash_attn:
811
+ self.config.attn_implementation = 'eager'
812
+ print('Warning: Flash attention is not available, using eager attention instead.')
813
 
814
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
815
 
modeling_internvl_chat.py CHANGED
@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
233
  return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
234
  IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
235
  if history is not None or return_history:
236
- print("Now multi-turn chat is not supported in batch_chat.")
237
  raise NotImplementedError
238
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
239
  self.img_context_token_id = img_context_token_id
@@ -241,12 +241,12 @@ class InternVLChatModel(PreTrainedModel):
241
  eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
242
  else:
243
  eos_token_id = tokenizer.eos_token_id
244
-
245
  from .conversation import get_conv_template
246
-
247
  queries = []
248
  image_bs = pixel_values.shape[0]
249
- print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
250
  for idx, image_count in enumerate(image_counts):
251
  image_token = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
252
  question = image_token + '\n' + questions[idx]
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
260
  input_ids = model_inputs['input_ids'].cuda()
261
  attention_mask = model_inputs['attention_mask'].cuda()
262
  generation_config['eos_token_id'] = eos_token_id
263
-
264
  generation_output = self.generate(
265
  pixel_values=pixel_values,
266
  input_ids=input_ids,
 
233
  return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
234
  IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
235
  if history is not None or return_history:
236
+ print('Now multi-turn chat is not supported in batch_chat.')
237
  raise NotImplementedError
238
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
239
  self.img_context_token_id = img_context_token_id
 
241
  eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
242
  else:
243
  eos_token_id = tokenizer.eos_token_id
244
+
245
  from .conversation import get_conv_template
246
+
247
  queries = []
248
  image_bs = pixel_values.shape[0]
249
+ # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
250
  for idx, image_count in enumerate(image_counts):
251
  image_token = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
252
  question = image_token + '\n' + questions[idx]
 
260
  input_ids = model_inputs['input_ids'].cuda()
261
  attention_mask = model_inputs['attention_mask'].cuda()
262
  generation_config['eos_token_id'] = eos_token_id
263
+
264
  generation_output = self.generate(
265
  pixel_values=pixel_values,
266
  input_ids=input_ids,
preprocessor_config.json CHANGED
@@ -16,4 +16,4 @@
16
  ],
17
  "resample": 3,
18
  "size": 448
19
- }
 
16
  ],
17
  "resample": 3,
18
  "size": 448
19
+ }