Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +32 -22
config.json +2 -1
configuration_intern_vit.py +2 -0
conversation.py +0 -1
modeling_intern_vit.py +16 -9
modeling_internlm2.py +18 -3
modeling_internvl_chat.py +5 -5
preprocessor_config.json +1 -1

README.md CHANGED Viewed

@@ -1,51 +1,56 @@
 ---
 license: mit
 datasets:
-- laion/laion2B-en
-- laion/laion-coco
-- laion/laion2B-multi
-- kakaobrain/coyo-700m
-- conceptual_captions
-- wanng/wukong100m
 pipeline_tag: visual-question-answering
 ---
-# Model Card for InternVL-Chat-V1.5-Int8
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
 </p>
 > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
-\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
 We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
-We introduce three simple designs:
-1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
-2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
-3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
   - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 25.5B
 - **Training Strategy:**
   - Learnable component in the pretraining stage: ViT + MLP
   - Learnable component in the finetuning stage: ViT + MLP + LLM
   - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
 ## Released Models
-| Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
-| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
-| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))      | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
-| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
-| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))      |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
 ## Architecture
@@ -70,7 +75,7 @@ We introduce three simple designs:
 We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
-You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
 > Please use transformers==4.37.2 to ensure the model works normally.
@@ -231,7 +236,6 @@ responses = model.batch_chat(tokenizer, pixel_values,
 for question, response in zip(questions, responses):
     print(question)
     print(response)
 ```
 ## Citation
@@ -245,12 +249,18 @@ If you find this project useful in your research, please consider citing:
   journal={arXiv preprint arXiv:2312.14238},
   year={2023}
 }
 ```
 ## License
-This project is released under the MIT license.
 ## Acknowledgement
-InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!

 ---
 license: mit
 datasets:
+  - laion/laion2B-en
+  - laion/laion-coco
+  - laion/laion2B-multi
+  - kakaobrain/coyo-700m
+  - conceptual_captions
+  - wanng/wukong100m
 pipeline_tag: visual-question-answering
 ---
+# Model Card for InternVL-Chat-V1-5-Int8
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
 </p>
 > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
+\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
 We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
+We introduce three simple designs:
+1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
+2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
+3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
   - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 25.5B
 - **Training Strategy:**
   - Learnable component in the pretraining stage: ViT + MLP
   - Learnable component in the finetuning stage: ViT + MLP + LLM
   - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
 ## Released Models
+|                                              Model                                               |                                     Vision Foundation Model                                     | Release Date | Note                                                                                                                                                               |
+| :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|      InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))       | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |  2024.04.18  | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
+| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.21  | more SFT data and stronger                                                                                                                                         |
+|      InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.11  | scaling up LLM to 34B                                                                                                                                              |
+|      InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))       | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |  2024.01.24  | support Chinese and stronger OCR                                                                                                                                   |
 ## Architecture
 We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
+You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
 > Please use transformers==4.37.2 to ensure the model works normally.
 for question, response in zip(questions, responses):
     print(question)
     print(response)
 ```
 ## Citation
   journal={arXiv preprint arXiv:2312.14238},
   year={2023}
 }
+@article{chen2024far,
+  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
+  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
+  journal={arXiv preprint arXiv:2404.16821},
+  year={2024}
+}
 ```
 ## License
+This project is released under the MIT license.
 ## Acknowledgement
+InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!

config.json CHANGED Viewed

@@ -6,7 +6,8 @@
   ],
   "auto_map": {
     "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
-    "AutoModel": "modeling_internvl_chat.InternVLChatModel"
   },
   "downsample_ratio": 0.5,
   "dynamic_image_size": true,

   ],
   "auto_map": {
     "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
+    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
+    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
   },
   "downsample_ratio": 0.5,
   "dynamic_image_size": true,

configuration_intern_vit.py CHANGED Viewed

@@ -73,6 +73,7 @@ class InternVisionConfig(PretrainedConfig):
             num_hidden_layers=48,
             use_flash_attn=True,
             hidden_act='gelu',
             layer_norm_eps=1e-6,
             dropout=0.0,
             drop_path_rate=0.0,
@@ -97,6 +98,7 @@ class InternVisionConfig(PretrainedConfig):
         self.attention_dropout = attention_dropout
         self.layer_norm_eps = layer_norm_eps
         self.hidden_act = hidden_act
         self.qkv_bias = qkv_bias
         self.qk_normalization = qk_normalization
         self.use_flash_attn = use_flash_attn

             num_hidden_layers=48,
             use_flash_attn=True,
             hidden_act='gelu',
+            norm_type='rms_norm',
             layer_norm_eps=1e-6,
             dropout=0.0,
             drop_path_rate=0.0,
         self.attention_dropout = attention_dropout
         self.layer_norm_eps = layer_norm_eps
         self.hidden_act = hidden_act
+        self.norm_type = norm_type
         self.qkv_bias = qkv_bias
         self.qk_normalization = qk_normalization
         self.use_flash_attn = use_flash_attn

conversation.py CHANGED Viewed

@@ -1258,4 +1258,3 @@ register_conv_template(
         sep2='</s>',
     )
 )

         sep2='</s>',
     )
 )

modeling_intern_vit.py CHANGED Viewed

@@ -26,9 +26,9 @@ try:
     except:  # v2
         from flash_attn.flash_attn_interface import \
             flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
     from flash_attn.bert_padding import pad_input, unpad_input
     has_flash_attn = True
 except:
     print('FlashAttention is not installed.')
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
         attention_dropout: The dropout rate to apply to the attention
                            (default: 0.0)
     """
     def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         super().__init__()
         self.softmax_scale = softmax_scale
         self.dropout_p = attention_dropout
     def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
                 max_s=None, need_weights=False):
         """Implements the multihead softmax attention.
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
         assert not need_weights
         assert qkv.dtype in [torch.float16, torch.bfloat16]
         assert qkv.is_cuda
         if cu_seqlens is None:
             batch_size = qkv.shape[0]
             seqlen = qkv.shape[1]
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )
         return output, None
@@ -129,6 +129,12 @@ except Exception:
     pass
 class InternVisionEmbeddings(nn.Module):
     def __init__(self, config: InternVisionConfig):
         super().__init__()
@@ -154,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
         target_dtype = pos_embed.dtype
         pos_embed = pos_embed.float().reshape(
             1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
-        pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
             reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         return pos_embed
@@ -267,11 +273,12 @@ class InternVisionEncoderLayer(nn.Module):
         super().__init__()
         self.embed_dim = config.hidden_size
         self.intermediate_size = config.intermediate_size
         self.attn = InternAttention(config)
         self.mlp = InternMLP(config)
-        self.norm1 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
-        self.norm2 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
         self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
         self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))

     except:  # v2
         from flash_attn.flash_attn_interface import \
             flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
     from flash_attn.bert_padding import pad_input, unpad_input
     has_flash_attn = True
 except:
     print('FlashAttention is not installed.')
         attention_dropout: The dropout rate to apply to the attention
                            (default: 0.0)
     """
     def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         super().__init__()
         self.softmax_scale = softmax_scale
         self.dropout_p = attention_dropout
     def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
                 max_s=None, need_weights=False):
         """Implements the multihead softmax attention.
         assert not need_weights
         assert qkv.dtype in [torch.float16, torch.bfloat16]
         assert qkv.is_cuda
         if cu_seqlens is None:
             batch_size = qkv.shape[0]
             seqlen = qkv.shape[1]
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )
         return output, None
     pass
+NORM2FN = {
+    'rms_norm': InternRMSNorm,
+    'layer_norm': nn.LayerNorm,
+}
 class InternVisionEmbeddings(nn.Module):
     def __init__(self, config: InternVisionConfig):
         super().__init__()
         target_dtype = pos_embed.dtype
         pos_embed = pos_embed.float().reshape(
             1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
+        pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
             reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         return pos_embed
         super().__init__()
         self.embed_dim = config.hidden_size
         self.intermediate_size = config.intermediate_size
+        self.norm_type = config.norm_type
         self.attn = InternAttention(config)
         self.mlp = InternMLP(config)
+        self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
+        self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
         self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
         self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))

modeling_internlm2.py CHANGED Viewed

@@ -48,6 +48,18 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
 flash_attn_func, flash_attn_varlen_func = None, None
 pad_input, index_first_axis, unpad_input = None, None, None
 def _import_flash_attn():
@@ -149,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -178,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         t = t / self.scaling_factor
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -208,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
             inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
             self.register_buffer('inv_freq', inv_freq, persistent=False)
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -795,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.config = config
         self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)

 flash_attn_func, flash_attn_varlen_func = None, None
 pad_input, index_first_axis, unpad_input = None, None, None
+try:
+    from flash_attn import flash_attn_func as _flash_attn_func
+    from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis as _index_first_axis
+    from flash_attn.bert_padding import pad_input as _pad_input
+    from flash_attn.bert_padding import unpad_input as _unpad_input
+    flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
+    pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
+    has_flash_attn = True
+except:
+    has_flash_attn = False
 def _import_flash_attn():
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         t = t / self.scaling_factor
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
             inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
             self.register_buffer('inv_freq', inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.config = config
+        if not has_flash_attn:
+            self.config.attn_implementation = 'eager'
+            print('Warning: Flash attention is not available, using eager attention instead.')
         self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)

modeling_internvl_chat.py CHANGED Viewed

@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
                          return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
                          IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         if history is not None or return_history:
-            print("Now multi-turn chat is not supported in batch_chat.")
             raise NotImplementedError
         img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         self.img_context_token_id = img_context_token_id
@@ -241,12 +241,12 @@ class InternVLChatModel(PreTrainedModel):
             eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         else:
             eos_token_id = tokenizer.eos_token_id
         from .conversation import get_conv_template
         queries = []
         image_bs = pixel_values.shape[0]
-        print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
         for idx, image_count in enumerate(image_counts):
             image_token = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
             question = image_token + '\n' + questions[idx]
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
         input_ids = model_inputs['input_ids'].cuda()
         attention_mask = model_inputs['attention_mask'].cuda()
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,

                          return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
                          IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         if history is not None or return_history:
+            print('Now multi-turn chat is not supported in batch_chat.')
             raise NotImplementedError
         img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         self.img_context_token_id = img_context_token_id
             eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         else:
             eos_token_id = tokenizer.eos_token_id
         from .conversation import get_conv_template
         queries = []
         image_bs = pixel_values.shape[0]
+        # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
         for idx, image_count in enumerate(image_counts):
             image_token = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
             question = image_token + '\n' + questions[idx]
         input_ids = model_inputs['input_ids'].cuda()
         attention_mask = model_inputs['attention_mask'].cuda()
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,

preprocessor_config.json CHANGED Viewed

@@ -16,4 +16,4 @@
   ],
   "resample": 3,
   "size": 448
-}

   ],
   "resample": 3,
   "size": 448
+}