Upload folder using huggingface_hub
Browse files- README.md +33 -23
- config.json +2 -1
- configuration_intern_vit.py +2 -0
- conversation.py +0 -1
- modeling_intern_vit.py +16 -9
- modeling_internlm2.py +18 -3
- modeling_internvl_chat.py +4 -4
- preprocessor_config.json +1 -1
- runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/index.html +15 -0
- runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/index.html +15 -0
- runs/Apr15_22-00-14_SH-IDC1-10-140-37-13/index.html +15 -0
- runs/index.html +17 -0
README.md
CHANGED
@@ -1,51 +1,56 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
datasets:
|
4 |
-
- laion/laion2B-en
|
5 |
-
- laion/laion-coco
|
6 |
-
- laion/laion2B-multi
|
7 |
-
- kakaobrain/coyo-700m
|
8 |
-
- conceptual_captions
|
9 |
-
- wanng/wukong100m
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
-
# Model Card for InternVL-Chat-V1
|
|
|
14 |
<p align="center">
|
15 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
|
16 |
</p>
|
17 |
|
18 |
> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
19 |
|
20 |
-
\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
21 |
|
22 |
We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
|
23 |
-
We introduce three simple designs:
|
24 |
-
1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
|
25 |
-
2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
|
26 |
-
3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
|
27 |
|
|
|
|
|
|
|
28 |
|
29 |
## Model Details
|
|
|
30 |
- **Model Type:** multimodal large language model (MLLM)
|
|
|
31 |
- **Model Stats:**
|
|
|
32 |
- Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
|
33 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
34 |
- Params: 25.5B
|
35 |
|
36 |
- **Training Strategy:**
|
|
|
37 |
- Learnable component in the pretraining stage: ViT + MLP
|
38 |
- Learnable component in the finetuning stage: ViT + MLP + LLM
|
39 |
- For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
|
40 |
-
|
41 |
## Released Models
|
42 |
|
43 |
-
|
|
44 |
-
|
|
45 |
-
|
|
46 |
-
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))
|
47 |
-
|
|
48 |
-
|
|
49 |
|
50 |
## Architecture
|
51 |
|
@@ -70,7 +75,7 @@ We introduce three simple designs:
|
|
70 |
|
71 |
We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
|
72 |
|
73 |
-
You also
|
74 |
|
75 |
> Please use transformers==4.37.2 to ensure the model works normally.
|
76 |
|
@@ -161,7 +166,6 @@ def load_image(image_file, input_size=448, max_num=6):
|
|
161 |
pixel_values = torch.stack(pixel_values)
|
162 |
return pixel_values
|
163 |
|
164 |
-
|
165 |
path = "OpenGVLab/InternVL-Chat-V1-5"
|
166 |
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
|
167 |
model = AutoModel.from_pretrained(
|
@@ -243,12 +247,18 @@ If you find this project useful in your research, please consider citing:
|
|
243 |
journal={arXiv preprint arXiv:2312.14238},
|
244 |
year={2023}
|
245 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
246 |
```
|
247 |
|
248 |
## License
|
249 |
|
250 |
-
This project is released under the MIT license.
|
251 |
|
252 |
## Acknowledgement
|
253 |
|
254 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
datasets:
|
4 |
+
- laion/laion2B-en
|
5 |
+
- laion/laion-coco
|
6 |
+
- laion/laion2B-multi
|
7 |
+
- kakaobrain/coyo-700m
|
8 |
+
- conceptual_captions
|
9 |
+
- wanng/wukong100m
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
+
# Model Card for InternVL-Chat-V1-5
|
14 |
+
|
15 |
<p align="center">
|
16 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
|
17 |
</p>
|
18 |
|
19 |
> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
20 |
|
21 |
+
\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
|
22 |
|
23 |
We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
|
24 |
+
We introduce three simple designs:
|
|
|
|
|
|
|
25 |
|
26 |
+
1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
|
27 |
+
2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
|
28 |
+
3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
|
29 |
|
30 |
## Model Details
|
31 |
+
|
32 |
- **Model Type:** multimodal large language model (MLLM)
|
33 |
+
|
34 |
- **Model Stats:**
|
35 |
+
|
36 |
- Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
|
37 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
38 |
- Params: 25.5B
|
39 |
|
40 |
- **Training Strategy:**
|
41 |
+
|
42 |
- Learnable component in the pretraining stage: ViT + MLP
|
43 |
- Learnable component in the finetuning stage: ViT + MLP + LLM
|
44 |
- For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
|
45 |
+
|
46 |
## Released Models
|
47 |
|
48 |
+
| Model | Vision Foundation Model | Release Date | Note |
|
49 |
+
| :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
50 |
+
| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) | 2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
|
51 |
+
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.21 | more SFT data and stronger |
|
52 |
+
| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.11 | scaling up LLM to 34B |
|
53 |
+
| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) | 2024.01.24 | support Chinese and stronger OCR |
|
54 |
|
55 |
## Architecture
|
56 |
|
|
|
75 |
|
76 |
We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
|
77 |
|
78 |
+
You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
79 |
|
80 |
> Please use transformers==4.37.2 to ensure the model works normally.
|
81 |
|
|
|
166 |
pixel_values = torch.stack(pixel_values)
|
167 |
return pixel_values
|
168 |
|
|
|
169 |
path = "OpenGVLab/InternVL-Chat-V1-5"
|
170 |
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
|
171 |
model = AutoModel.from_pretrained(
|
|
|
247 |
journal={arXiv preprint arXiv:2312.14238},
|
248 |
year={2023}
|
249 |
}
|
250 |
+
@article{chen2024far,
|
251 |
+
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
|
252 |
+
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
|
253 |
+
journal={arXiv preprint arXiv:2404.16821},
|
254 |
+
year={2024}
|
255 |
+
}
|
256 |
```
|
257 |
|
258 |
## License
|
259 |
|
260 |
+
This project is released under the MIT license.
|
261 |
|
262 |
## Acknowledgement
|
263 |
|
264 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
config.json
CHANGED
@@ -6,7 +6,8 @@
|
|
6 |
],
|
7 |
"auto_map": {
|
8 |
"AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
|
9 |
-
"AutoModel": "modeling_internvl_chat.InternVLChatModel"
|
|
|
10 |
},
|
11 |
"downsample_ratio": 0.5,
|
12 |
"dynamic_image_size": true,
|
|
|
6 |
],
|
7 |
"auto_map": {
|
8 |
"AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
|
9 |
+
"AutoModel": "modeling_internvl_chat.InternVLChatModel",
|
10 |
+
"AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
|
11 |
},
|
12 |
"downsample_ratio": 0.5,
|
13 |
"dynamic_image_size": true,
|
configuration_intern_vit.py
CHANGED
@@ -73,6 +73,7 @@ class InternVisionConfig(PretrainedConfig):
|
|
73 |
num_hidden_layers=48,
|
74 |
use_flash_attn=True,
|
75 |
hidden_act='gelu',
|
|
|
76 |
layer_norm_eps=1e-6,
|
77 |
dropout=0.0,
|
78 |
drop_path_rate=0.0,
|
@@ -97,6 +98,7 @@ class InternVisionConfig(PretrainedConfig):
|
|
97 |
self.attention_dropout = attention_dropout
|
98 |
self.layer_norm_eps = layer_norm_eps
|
99 |
self.hidden_act = hidden_act
|
|
|
100 |
self.qkv_bias = qkv_bias
|
101 |
self.qk_normalization = qk_normalization
|
102 |
self.use_flash_attn = use_flash_attn
|
|
|
73 |
num_hidden_layers=48,
|
74 |
use_flash_attn=True,
|
75 |
hidden_act='gelu',
|
76 |
+
norm_type='rms_norm',
|
77 |
layer_norm_eps=1e-6,
|
78 |
dropout=0.0,
|
79 |
drop_path_rate=0.0,
|
|
|
98 |
self.attention_dropout = attention_dropout
|
99 |
self.layer_norm_eps = layer_norm_eps
|
100 |
self.hidden_act = hidden_act
|
101 |
+
self.norm_type = norm_type
|
102 |
self.qkv_bias = qkv_bias
|
103 |
self.qk_normalization = qk_normalization
|
104 |
self.use_flash_attn = use_flash_attn
|
conversation.py
CHANGED
@@ -1258,4 +1258,3 @@ register_conv_template(
|
|
1258 |
sep2='</s>',
|
1259 |
)
|
1260 |
)
|
1261 |
-
|
|
|
1258 |
sep2='</s>',
|
1259 |
)
|
1260 |
)
|
|
modeling_intern_vit.py
CHANGED
@@ -26,9 +26,9 @@ try:
|
|
26 |
except: # v2
|
27 |
from flash_attn.flash_attn_interface import \
|
28 |
flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
|
29 |
-
|
30 |
from flash_attn.bert_padding import pad_input, unpad_input
|
31 |
-
|
32 |
has_flash_attn = True
|
33 |
except:
|
34 |
print('FlashAttention is not installed.')
|
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
|
|
47 |
attention_dropout: The dropout rate to apply to the attention
|
48 |
(default: 0.0)
|
49 |
"""
|
50 |
-
|
51 |
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
|
52 |
super().__init__()
|
53 |
self.softmax_scale = softmax_scale
|
54 |
self.dropout_p = attention_dropout
|
55 |
-
|
56 |
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
|
57 |
max_s=None, need_weights=False):
|
58 |
"""Implements the multihead softmax attention.
|
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
|
|
65 |
assert not need_weights
|
66 |
assert qkv.dtype in [torch.float16, torch.bfloat16]
|
67 |
assert qkv.is_cuda
|
68 |
-
|
69 |
if cu_seqlens is None:
|
70 |
batch_size = qkv.shape[0]
|
71 |
seqlen = qkv.shape[1]
|
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
|
|
97 |
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
|
98 |
softmax_scale=self.softmax_scale, causal=causal
|
99 |
)
|
100 |
-
|
101 |
return output, None
|
102 |
|
103 |
|
@@ -129,6 +129,12 @@ except Exception:
|
|
129 |
pass
|
130 |
|
131 |
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
class InternVisionEmbeddings(nn.Module):
|
133 |
def __init__(self, config: InternVisionConfig):
|
134 |
super().__init__()
|
@@ -154,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
|
|
154 |
target_dtype = pos_embed.dtype
|
155 |
pos_embed = pos_embed.float().reshape(
|
156 |
1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
|
157 |
-
pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False)
|
158 |
reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
|
159 |
return pos_embed
|
160 |
|
@@ -267,11 +273,12 @@ class InternVisionEncoderLayer(nn.Module):
|
|
267 |
super().__init__()
|
268 |
self.embed_dim = config.hidden_size
|
269 |
self.intermediate_size = config.intermediate_size
|
|
|
270 |
|
271 |
self.attn = InternAttention(config)
|
272 |
self.mlp = InternMLP(config)
|
273 |
-
self.norm1 =
|
274 |
-
self.norm2 =
|
275 |
|
276 |
self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
|
277 |
self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
|
|
|
26 |
except: # v2
|
27 |
from flash_attn.flash_attn_interface import \
|
28 |
flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
|
29 |
+
|
30 |
from flash_attn.bert_padding import pad_input, unpad_input
|
31 |
+
|
32 |
has_flash_attn = True
|
33 |
except:
|
34 |
print('FlashAttention is not installed.')
|
|
|
47 |
attention_dropout: The dropout rate to apply to the attention
|
48 |
(default: 0.0)
|
49 |
"""
|
50 |
+
|
51 |
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
|
52 |
super().__init__()
|
53 |
self.softmax_scale = softmax_scale
|
54 |
self.dropout_p = attention_dropout
|
55 |
+
|
56 |
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
|
57 |
max_s=None, need_weights=False):
|
58 |
"""Implements the multihead softmax attention.
|
|
|
65 |
assert not need_weights
|
66 |
assert qkv.dtype in [torch.float16, torch.bfloat16]
|
67 |
assert qkv.is_cuda
|
68 |
+
|
69 |
if cu_seqlens is None:
|
70 |
batch_size = qkv.shape[0]
|
71 |
seqlen = qkv.shape[1]
|
|
|
97 |
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
|
98 |
softmax_scale=self.softmax_scale, causal=causal
|
99 |
)
|
100 |
+
|
101 |
return output, None
|
102 |
|
103 |
|
|
|
129 |
pass
|
130 |
|
131 |
|
132 |
+
NORM2FN = {
|
133 |
+
'rms_norm': InternRMSNorm,
|
134 |
+
'layer_norm': nn.LayerNorm,
|
135 |
+
}
|
136 |
+
|
137 |
+
|
138 |
class InternVisionEmbeddings(nn.Module):
|
139 |
def __init__(self, config: InternVisionConfig):
|
140 |
super().__init__()
|
|
|
160 |
target_dtype = pos_embed.dtype
|
161 |
pos_embed = pos_embed.float().reshape(
|
162 |
1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
|
163 |
+
pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
|
164 |
reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
|
165 |
return pos_embed
|
166 |
|
|
|
273 |
super().__init__()
|
274 |
self.embed_dim = config.hidden_size
|
275 |
self.intermediate_size = config.intermediate_size
|
276 |
+
self.norm_type = config.norm_type
|
277 |
|
278 |
self.attn = InternAttention(config)
|
279 |
self.mlp = InternMLP(config)
|
280 |
+
self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
|
281 |
+
self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
|
282 |
|
283 |
self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
|
284 |
self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
|
modeling_internlm2.py
CHANGED
@@ -48,6 +48,18 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
|
|
48 |
|
49 |
flash_attn_func, flash_attn_varlen_func = None, None
|
50 |
pad_input, index_first_axis, unpad_input = None, None, None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
|
53 |
def _import_flash_attn():
|
@@ -149,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
|
|
149 |
|
150 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
151 |
self.max_seq_len_cached = seq_len
|
152 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
153 |
|
154 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
155 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
@@ -178,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
|
|
178 |
|
179 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
180 |
self.max_seq_len_cached = seq_len
|
181 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
182 |
t = t / self.scaling_factor
|
183 |
|
184 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
@@ -208,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
|
|
208 |
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
209 |
self.register_buffer('inv_freq', inv_freq, persistent=False)
|
210 |
|
211 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
212 |
|
213 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
214 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
@@ -795,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
|
|
795 |
self.padding_idx = config.pad_token_id
|
796 |
self.vocab_size = config.vocab_size
|
797 |
self.config = config
|
|
|
|
|
|
|
798 |
|
799 |
self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
800 |
|
|
|
48 |
|
49 |
flash_attn_func, flash_attn_varlen_func = None, None
|
50 |
pad_input, index_first_axis, unpad_input = None, None, None
|
51 |
+
try:
|
52 |
+
from flash_attn import flash_attn_func as _flash_attn_func
|
53 |
+
from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
|
54 |
+
from flash_attn.bert_padding import index_first_axis as _index_first_axis
|
55 |
+
from flash_attn.bert_padding import pad_input as _pad_input
|
56 |
+
from flash_attn.bert_padding import unpad_input as _unpad_input
|
57 |
+
|
58 |
+
flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
|
59 |
+
pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
|
60 |
+
has_flash_attn = True
|
61 |
+
except:
|
62 |
+
has_flash_attn = False
|
63 |
|
64 |
|
65 |
def _import_flash_attn():
|
|
|
161 |
|
162 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
163 |
self.max_seq_len_cached = seq_len
|
164 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
165 |
|
166 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
167 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
|
|
190 |
|
191 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
192 |
self.max_seq_len_cached = seq_len
|
193 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
194 |
t = t / self.scaling_factor
|
195 |
|
196 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
|
|
220 |
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
221 |
self.register_buffer('inv_freq', inv_freq, persistent=False)
|
222 |
|
223 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
224 |
|
225 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
226 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
|
|
807 |
self.padding_idx = config.pad_token_id
|
808 |
self.vocab_size = config.vocab_size
|
809 |
self.config = config
|
810 |
+
if not has_flash_attn:
|
811 |
+
self.config.attn_implementation = 'eager'
|
812 |
+
print('Warning: Flash attention is not available, using eager attention instead.')
|
813 |
|
814 |
self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
815 |
|
modeling_internvl_chat.py
CHANGED
@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
|
|
233 |
return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
|
234 |
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
|
235 |
if history is not None or return_history:
|
236 |
-
print(
|
237 |
raise NotImplementedError
|
238 |
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
|
239 |
self.img_context_token_id = img_context_token_id
|
@@ -241,9 +241,9 @@ class InternVLChatModel(PreTrainedModel):
|
|
241 |
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
|
242 |
else:
|
243 |
eos_token_id = tokenizer.eos_token_id
|
244 |
-
|
245 |
from .conversation import get_conv_template
|
246 |
-
|
247 |
queries = []
|
248 |
image_bs = pixel_values.shape[0]
|
249 |
# print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
|
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
|
|
260 |
input_ids = model_inputs['input_ids'].cuda()
|
261 |
attention_mask = model_inputs['attention_mask'].cuda()
|
262 |
generation_config['eos_token_id'] = eos_token_id
|
263 |
-
|
264 |
generation_output = self.generate(
|
265 |
pixel_values=pixel_values,
|
266 |
input_ids=input_ids,
|
|
|
233 |
return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
|
234 |
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
|
235 |
if history is not None or return_history:
|
236 |
+
print('Now multi-turn chat is not supported in batch_chat.')
|
237 |
raise NotImplementedError
|
238 |
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
|
239 |
self.img_context_token_id = img_context_token_id
|
|
|
241 |
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
|
242 |
else:
|
243 |
eos_token_id = tokenizer.eos_token_id
|
244 |
+
|
245 |
from .conversation import get_conv_template
|
246 |
+
|
247 |
queries = []
|
248 |
image_bs = pixel_values.shape[0]
|
249 |
# print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
|
|
|
260 |
input_ids = model_inputs['input_ids'].cuda()
|
261 |
attention_mask = model_inputs['attention_mask'].cuda()
|
262 |
generation_config['eos_token_id'] = eos_token_id
|
263 |
+
|
264 |
generation_output = self.generate(
|
265 |
pixel_values=pixel_values,
|
266 |
input_ids=input_ids,
|
preprocessor_config.json
CHANGED
@@ -16,4 +16,4 @@
|
|
16 |
],
|
17 |
"resample": 3,
|
18 |
"size": 448
|
19 |
-
}
|
|
|
16 |
],
|
17 |
"resample": 3,
|
18 |
"size": 448
|
19 |
+
}
|
runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/index.html
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
5 |
+
<title>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/</title>
|
6 |
+
</head>
|
7 |
+
<body>
|
8 |
+
<h1>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/</h1>
|
9 |
+
<hr>
|
10 |
+
<ul>
|
11 |
+
<li><a href="events.out.tfevents.1713171220.SH-IDC1-10-140-37-13.204150.0">events.out.tfevents.1713171220.SH-IDC1-10-140-37-13.204150.0</a></li>
|
12 |
+
</ul>
|
13 |
+
<hr>
|
14 |
+
</body>
|
15 |
+
</html>
|
runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/index.html
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
5 |
+
<title>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/</title>
|
6 |
+
</head>
|
7 |
+
<body>
|
8 |
+
<h1>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/</h1>
|
9 |
+
<hr>
|
10 |
+
<ul>
|
11 |
+
<li><a href="events.out.tfevents.1713174123.SH-IDC1-10-140-37-13.259480.0">events.out.tfevents.1713174123.SH-IDC1-10-140-37-13.259480.0</a></li>
|
12 |
+
</ul>
|
13 |
+
<hr>
|
14 |
+
</body>
|
15 |
+
</html>
|
runs/Apr15_22-00-14_SH-IDC1-10-140-37-13/index.html
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
5 |
+
<title>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_22-00-14_SH-IDC1-10-140-37-13/</title>
|
6 |
+
</head>
|
7 |
+
<body>
|
8 |
+
<h1>Directory listing for /InternVL-Chat-V1-5/runs/Apr15_22-00-14_SH-IDC1-10-140-37-13/</h1>
|
9 |
+
<hr>
|
10 |
+
<ul>
|
11 |
+
<li><a href="events.out.tfevents.1713190241.SH-IDC1-10-140-37-13.10620.0">events.out.tfevents.1713190241.SH-IDC1-10-140-37-13.10620.0</a></li>
|
12 |
+
</ul>
|
13 |
+
<hr>
|
14 |
+
</body>
|
15 |
+
</html>
|
runs/index.html
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
5 |
+
<title>Directory listing for /InternVL-Chat-V1-5/runs/</title>
|
6 |
+
</head>
|
7 |
+
<body>
|
8 |
+
<h1>Directory listing for /InternVL-Chat-V1-5/runs/</h1>
|
9 |
+
<hr>
|
10 |
+
<ul>
|
11 |
+
<li><a href="Apr15_16-44-40_SH-IDC1-10-140-37-13/">Apr15_16-44-40_SH-IDC1-10-140-37-13/</a></li>
|
12 |
+
<li><a href="Apr15_17-33-22_SH-IDC1-10-140-37-13/">Apr15_17-33-22_SH-IDC1-10-140-37-13/</a></li>
|
13 |
+
<li><a href="Apr15_22-00-14_SH-IDC1-10-140-37-13/">Apr15_22-00-14_SH-IDC1-10-140-37-13/</a></li>
|
14 |
+
</ul>
|
15 |
+
<hr>
|
16 |
+
</body>
|
17 |
+
</html>
|