Upload folder using huggingface_hub
Browse files- README.md +29 -22
- config.json +10 -5
- conversation.py +0 -1
- modeling_intern_vit.py +7 -7
- modeling_internlm2.py +10 -10
- modeling_internvl_chat.py +4 -4
- preprocessor_config.json +1 -1
README.md
CHANGED
@@ -1,55 +1,57 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
datasets:
|
4 |
-
- laion/laion2B-en
|
5 |
-
- laion/laion-coco
|
6 |
-
- laion/laion2B-multi
|
7 |
-
- kakaobrain/coyo-700m
|
8 |
-
- conceptual_captions
|
9 |
-
- wanng/wukong100m
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
# Model Card for Mini-InternVL-Chat-2B-V1-5
|
|
|
14 |
<p align="center">
|
15 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
|
16 |
</p>
|
17 |
|
18 |
> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
19 |
|
20 |
-
\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
21 |
|
22 |
You can run multimodal large models using a 1080Ti now.
|
23 |
|
24 |
We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
|
25 |
|
26 |
-
|
27 |
As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
|
28 |
|
29 |
-
|
30 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
|
31 |
|
32 |
-
|
33 |
## Model Details
|
|
|
34 |
- **Model Type:** multimodal large language model (MLLM)
|
|
|
35 |
- **Model Stats:**
|
|
|
36 |
- Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
|
37 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
38 |
- Params: 2.2B
|
39 |
|
40 |
- **Training Strategy:**
|
|
|
41 |
- Learnable component in the pretraining stage: ViT + MLP
|
42 |
- Learnable component in the finetuning stage: ViT + MLP + LLM
|
43 |
-
- For more details on training hyperparameters, take a look at our code: [pretrain]() | [finetune]()
|
44 |
-
|
45 |
## Released Models
|
46 |
|
47 |
-
|
|
48 |
-
|
|
49 |
-
|
|
50 |
-
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))
|
51 |
-
|
|
52 |
-
|
|
53 |
|
54 |
## Performance
|
55 |
|
@@ -59,7 +61,7 @@ As shown in the figure below, we adopted the same model architecture as InternVL
|
|
59 |
|
60 |
We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
|
61 |
|
62 |
-
You can also use our [online demo](https://internvl.opengvlab.com/)
|
63 |
|
64 |
> Please use transformers==4.37.2 to ensure the model works normally.
|
65 |
|
@@ -150,7 +152,6 @@ def load_image(image_file, input_size=448, max_num=6):
|
|
150 |
pixel_values = torch.stack(pixel_values)
|
151 |
return pixel_values
|
152 |
|
153 |
-
|
154 |
path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
|
155 |
model = AutoModel.from_pretrained(
|
156 |
path,
|
@@ -222,12 +223,18 @@ If you find this project useful in your research, please consider citing:
|
|
222 |
journal={arXiv preprint arXiv:2312.14238},
|
223 |
year={2023}
|
224 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
225 |
```
|
226 |
|
227 |
## License
|
228 |
|
229 |
-
This project is released under the MIT license.
|
230 |
|
231 |
## Acknowledgement
|
232 |
|
233 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
datasets:
|
4 |
+
- laion/laion2B-en
|
5 |
+
- laion/laion-coco
|
6 |
+
- laion/laion2B-multi
|
7 |
+
- kakaobrain/coyo-700m
|
8 |
+
- conceptual_captions
|
9 |
+
- wanng/wukong100m
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
# Model Card for Mini-InternVL-Chat-2B-V1-5
|
14 |
+
|
15 |
<p align="center">
|
16 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
|
17 |
</p>
|
18 |
|
19 |
> _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
|
20 |
|
21 |
+
\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
|
22 |
|
23 |
You can run multimodal large models using a 1080Ti now.
|
24 |
|
25 |
We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
|
26 |
|
|
|
27 |
As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
|
28 |
|
|
|
29 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
|
30 |
|
|
|
31 |
## Model Details
|
32 |
+
|
33 |
- **Model Type:** multimodal large language model (MLLM)
|
34 |
+
|
35 |
- **Model Stats:**
|
36 |
+
|
37 |
- Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
|
38 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
39 |
- Params: 2.2B
|
40 |
|
41 |
- **Training Strategy:**
|
42 |
+
|
43 |
- Learnable component in the pretraining stage: ViT + MLP
|
44 |
- Learnable component in the finetuning stage: ViT + MLP + LLM
|
45 |
+
- For more details on training hyperparameters, take a look at our code: [pretrain](<>) | [finetune](<>)
|
46 |
+
|
47 |
## Released Models
|
48 |
|
49 |
+
| Model | Vision Foundation Model | Release Date | Note |
|
50 |
+
| :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
51 |
+
| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) | 2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
|
52 |
+
| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.21 | more SFT data and stronger |
|
53 |
+
| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.11 | scaling up LLM to 34B |
|
54 |
+
| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) | 2024.01.24 | support Chinese and stronger OCR |
|
55 |
|
56 |
## Performance
|
57 |
|
|
|
61 |
|
62 |
We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
|
63 |
|
64 |
+
You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
65 |
|
66 |
> Please use transformers==4.37.2 to ensure the model works normally.
|
67 |
|
|
|
152 |
pixel_values = torch.stack(pixel_values)
|
153 |
return pixel_values
|
154 |
|
|
|
155 |
path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
|
156 |
model = AutoModel.from_pretrained(
|
157 |
path,
|
|
|
223 |
journal={arXiv preprint arXiv:2312.14238},
|
224 |
year={2023}
|
225 |
}
|
226 |
+
@article{chen2024far,
|
227 |
+
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
|
228 |
+
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
|
229 |
+
journal={arXiv preprint arXiv:2404.16821},
|
230 |
+
year={2024}
|
231 |
+
}
|
232 |
```
|
233 |
|
234 |
## License
|
235 |
|
236 |
+
This project is released under the MIT license.
|
237 |
|
238 |
## Acknowledgement
|
239 |
|
240 |
+
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
config.json
CHANGED
@@ -6,13 +6,14 @@
|
|
6 |
],
|
7 |
"auto_map": {
|
8 |
"AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
|
9 |
-
"AutoModel": "modeling_internvl_chat.InternVLChatModel"
|
|
|
10 |
},
|
11 |
"downsample_ratio": 0.5,
|
12 |
"dynamic_image_size": true,
|
13 |
"force_image_size": 448,
|
14 |
"llm_config": {
|
15 |
-
"_name_or_path": "
|
16 |
"add_cross_attention": false,
|
17 |
"architectures": [
|
18 |
"InternLM2ForCausalLM"
|
@@ -113,12 +114,16 @@
|
|
113 |
"use_llm_lora": 0,
|
114 |
"use_thumbnail": true,
|
115 |
"vision_config": {
|
116 |
-
"_name_or_path": "",
|
117 |
"add_cross_attention": false,
|
118 |
"architectures": [
|
119 |
"InternVisionModel"
|
120 |
],
|
121 |
"attention_dropout": 0.0,
|
|
|
|
|
|
|
|
|
122 |
"bad_words_ids": null,
|
123 |
"begin_suppress_tokens": null,
|
124 |
"bos_token_id": null,
|
@@ -189,11 +194,11 @@
|
|
189 |
"tokenizer_class": null,
|
190 |
"top_k": 50,
|
191 |
"top_p": 1.0,
|
192 |
-
"torch_dtype": "
|
193 |
"torchscript": false,
|
194 |
"transformers_version": "4.36.2",
|
195 |
"typical_p": 1.0,
|
196 |
-
"use_bfloat16":
|
197 |
"use_flash_attn": true
|
198 |
}
|
199 |
}
|
|
|
6 |
],
|
7 |
"auto_map": {
|
8 |
"AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
|
9 |
+
"AutoModel": "modeling_internvl_chat.InternVLChatModel",
|
10 |
+
"AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
|
11 |
},
|
12 |
"downsample_ratio": 0.5,
|
13 |
"dynamic_image_size": true,
|
14 |
"force_image_size": 448,
|
15 |
"llm_config": {
|
16 |
+
"_name_or_path": "pretrained/internlm2-chat-1_8b",
|
17 |
"add_cross_attention": false,
|
18 |
"architectures": [
|
19 |
"InternLM2ForCausalLM"
|
|
|
114 |
"use_llm_lora": 0,
|
115 |
"use_thumbnail": true,
|
116 |
"vision_config": {
|
117 |
+
"_name_or_path": "OpenGVLab/InternViT-300M-448px",
|
118 |
"add_cross_attention": false,
|
119 |
"architectures": [
|
120 |
"InternVisionModel"
|
121 |
],
|
122 |
"attention_dropout": 0.0,
|
123 |
+
"auto_map": {
|
124 |
+
"AutoConfig": "configuration_intern_vit.InternVisionConfig",
|
125 |
+
"AutoModel": "modeling_intern_vit.InternVisionModel"
|
126 |
+
},
|
127 |
"bad_words_ids": null,
|
128 |
"begin_suppress_tokens": null,
|
129 |
"bos_token_id": null,
|
|
|
194 |
"tokenizer_class": null,
|
195 |
"top_k": 50,
|
196 |
"top_p": 1.0,
|
197 |
+
"torch_dtype": "bfloat16",
|
198 |
"torchscript": false,
|
199 |
"transformers_version": "4.36.2",
|
200 |
"typical_p": 1.0,
|
201 |
+
"use_bfloat16": true,
|
202 |
"use_flash_attn": true
|
203 |
}
|
204 |
}
|
conversation.py
CHANGED
@@ -1258,4 +1258,3 @@ register_conv_template(
|
|
1258 |
sep2='</s>',
|
1259 |
)
|
1260 |
)
|
1261 |
-
|
|
|
1258 |
sep2='</s>',
|
1259 |
)
|
1260 |
)
|
|
modeling_intern_vit.py
CHANGED
@@ -26,9 +26,9 @@ try:
|
|
26 |
except: # v2
|
27 |
from flash_attn.flash_attn_interface import \
|
28 |
flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
|
29 |
-
|
30 |
from flash_attn.bert_padding import pad_input, unpad_input
|
31 |
-
|
32 |
has_flash_attn = True
|
33 |
except:
|
34 |
print('FlashAttention is not installed.')
|
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
|
|
47 |
attention_dropout: The dropout rate to apply to the attention
|
48 |
(default: 0.0)
|
49 |
"""
|
50 |
-
|
51 |
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
|
52 |
super().__init__()
|
53 |
self.softmax_scale = softmax_scale
|
54 |
self.dropout_p = attention_dropout
|
55 |
-
|
56 |
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
|
57 |
max_s=None, need_weights=False):
|
58 |
"""Implements the multihead softmax attention.
|
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
|
|
65 |
assert not need_weights
|
66 |
assert qkv.dtype in [torch.float16, torch.bfloat16]
|
67 |
assert qkv.is_cuda
|
68 |
-
|
69 |
if cu_seqlens is None:
|
70 |
batch_size = qkv.shape[0]
|
71 |
seqlen = qkv.shape[1]
|
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
|
|
97 |
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
|
98 |
softmax_scale=self.softmax_scale, causal=causal
|
99 |
)
|
100 |
-
|
101 |
return output, None
|
102 |
|
103 |
|
@@ -160,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
|
|
160 |
target_dtype = pos_embed.dtype
|
161 |
pos_embed = pos_embed.float().reshape(
|
162 |
1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
|
163 |
-
pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False)
|
164 |
reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
|
165 |
return pos_embed
|
166 |
|
|
|
26 |
except: # v2
|
27 |
from flash_attn.flash_attn_interface import \
|
28 |
flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
|
29 |
+
|
30 |
from flash_attn.bert_padding import pad_input, unpad_input
|
31 |
+
|
32 |
has_flash_attn = True
|
33 |
except:
|
34 |
print('FlashAttention is not installed.')
|
|
|
47 |
attention_dropout: The dropout rate to apply to the attention
|
48 |
(default: 0.0)
|
49 |
"""
|
50 |
+
|
51 |
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
|
52 |
super().__init__()
|
53 |
self.softmax_scale = softmax_scale
|
54 |
self.dropout_p = attention_dropout
|
55 |
+
|
56 |
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
|
57 |
max_s=None, need_weights=False):
|
58 |
"""Implements the multihead softmax attention.
|
|
|
65 |
assert not need_weights
|
66 |
assert qkv.dtype in [torch.float16, torch.bfloat16]
|
67 |
assert qkv.is_cuda
|
68 |
+
|
69 |
if cu_seqlens is None:
|
70 |
batch_size = qkv.shape[0]
|
71 |
seqlen = qkv.shape[1]
|
|
|
97 |
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
|
98 |
softmax_scale=self.softmax_scale, causal=causal
|
99 |
)
|
100 |
+
|
101 |
return output, None
|
102 |
|
103 |
|
|
|
160 |
target_dtype = pos_embed.dtype
|
161 |
pos_embed = pos_embed.float().reshape(
|
162 |
1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
|
163 |
+
pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
|
164 |
reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
|
165 |
return pos_embed
|
166 |
|
modeling_internlm2.py
CHANGED
@@ -48,16 +48,13 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
|
|
48 |
|
49 |
flash_attn_func, flash_attn_varlen_func = None, None
|
50 |
pad_input, index_first_axis, unpad_input = None, None, None
|
51 |
-
|
52 |
try:
|
53 |
from flash_attn import flash_attn_func as _flash_attn_func
|
54 |
-
from flash_attn import
|
55 |
-
|
56 |
-
from flash_attn.bert_padding import \
|
57 |
-
index_first_axis as _index_first_axis
|
58 |
from flash_attn.bert_padding import pad_input as _pad_input
|
59 |
from flash_attn.bert_padding import unpad_input as _unpad_input
|
60 |
-
|
61 |
flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
|
62 |
pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
|
63 |
has_flash_attn = True
|
@@ -164,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
|
|
164 |
|
165 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
166 |
self.max_seq_len_cached = seq_len
|
167 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
168 |
|
169 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
170 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
@@ -193,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
|
|
193 |
|
194 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
195 |
self.max_seq_len_cached = seq_len
|
196 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
197 |
t = t / self.scaling_factor
|
198 |
|
199 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
@@ -223,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
|
|
223 |
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
224 |
self.register_buffer('inv_freq', inv_freq, persistent=False)
|
225 |
|
226 |
-
t = torch.arange(self.max_seq_len_cached, device=device
|
227 |
|
228 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
229 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
@@ -810,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
|
|
810 |
self.padding_idx = config.pad_token_id
|
811 |
self.vocab_size = config.vocab_size
|
812 |
self.config = config
|
|
|
|
|
|
|
813 |
|
814 |
self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
815 |
|
@@ -870,7 +870,7 @@ class InternLM2Model(InternLM2PreTrainedModel):
|
|
870 |
|
871 |
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
872 |
|
873 |
-
if self.config.attn_implementation == 'flash_attention_2'
|
874 |
_import_flash_attn()
|
875 |
|
876 |
# retrieve input_ids and inputs_embeds
|
|
|
48 |
|
49 |
flash_attn_func, flash_attn_varlen_func = None, None
|
50 |
pad_input, index_first_axis, unpad_input = None, None, None
|
|
|
51 |
try:
|
52 |
from flash_attn import flash_attn_func as _flash_attn_func
|
53 |
+
from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
|
54 |
+
from flash_attn.bert_padding import index_first_axis as _index_first_axis
|
|
|
|
|
55 |
from flash_attn.bert_padding import pad_input as _pad_input
|
56 |
from flash_attn.bert_padding import unpad_input as _unpad_input
|
57 |
+
|
58 |
flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
|
59 |
pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
|
60 |
has_flash_attn = True
|
|
|
161 |
|
162 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
163 |
self.max_seq_len_cached = seq_len
|
164 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
165 |
|
166 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
167 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
|
|
190 |
|
191 |
def _set_cos_sin_cache(self, seq_len, device, dtype):
|
192 |
self.max_seq_len_cached = seq_len
|
193 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
194 |
t = t / self.scaling_factor
|
195 |
|
196 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
|
|
220 |
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
|
221 |
self.register_buffer('inv_freq', inv_freq, persistent=False)
|
222 |
|
223 |
+
t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
|
224 |
|
225 |
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
|
226 |
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
|
|
807 |
self.padding_idx = config.pad_token_id
|
808 |
self.vocab_size = config.vocab_size
|
809 |
self.config = config
|
810 |
+
if not has_flash_attn:
|
811 |
+
self.config.attn_implementation = 'eager'
|
812 |
+
print('Warning: Flash attention is not available, using eager attention instead.')
|
813 |
|
814 |
self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
|
815 |
|
|
|
870 |
|
871 |
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
872 |
|
873 |
+
if self.config.attn_implementation == 'flash_attention_2':
|
874 |
_import_flash_attn()
|
875 |
|
876 |
# retrieve input_ids and inputs_embeds
|
modeling_internvl_chat.py
CHANGED
@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
|
|
233 |
return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
|
234 |
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
|
235 |
if history is not None or return_history:
|
236 |
-
print(
|
237 |
raise NotImplementedError
|
238 |
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
|
239 |
self.img_context_token_id = img_context_token_id
|
@@ -241,9 +241,9 @@ class InternVLChatModel(PreTrainedModel):
|
|
241 |
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
|
242 |
else:
|
243 |
eos_token_id = tokenizer.eos_token_id
|
244 |
-
|
245 |
from .conversation import get_conv_template
|
246 |
-
|
247 |
queries = []
|
248 |
image_bs = pixel_values.shape[0]
|
249 |
# print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
|
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
|
|
260 |
input_ids = model_inputs['input_ids'].cuda()
|
261 |
attention_mask = model_inputs['attention_mask'].cuda()
|
262 |
generation_config['eos_token_id'] = eos_token_id
|
263 |
-
|
264 |
generation_output = self.generate(
|
265 |
pixel_values=pixel_values,
|
266 |
input_ids=input_ids,
|
|
|
233 |
return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
|
234 |
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
|
235 |
if history is not None or return_history:
|
236 |
+
print('Now multi-turn chat is not supported in batch_chat.')
|
237 |
raise NotImplementedError
|
238 |
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
|
239 |
self.img_context_token_id = img_context_token_id
|
|
|
241 |
eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
|
242 |
else:
|
243 |
eos_token_id = tokenizer.eos_token_id
|
244 |
+
|
245 |
from .conversation import get_conv_template
|
246 |
+
|
247 |
queries = []
|
248 |
image_bs = pixel_values.shape[0]
|
249 |
# print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
|
|
|
260 |
input_ids = model_inputs['input_ids'].cuda()
|
261 |
attention_mask = model_inputs['attention_mask'].cuda()
|
262 |
generation_config['eos_token_id'] = eos_token_id
|
263 |
+
|
264 |
generation_output = self.generate(
|
265 |
pixel_values=pixel_values,
|
266 |
input_ids=input_ids,
|
preprocessor_config.json
CHANGED
@@ -16,4 +16,4 @@
|
|
16 |
],
|
17 |
"resample": 3,
|
18 |
"size": 448
|
19 |
-
}
|
|
|
16 |
],
|
17 |
"resample": 3,
|
18 |
"size": 448
|
19 |
+
}
|