Upload InternVideo2_Classification_test

Browse files

Files changed (4) hide show

README.md +199 -0
config.json +53 -0
model.safetensors +3 -0
modeling_videochat2_classification.py +420 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "architectures": [
+    "InternVideo2_Classification_test"
+  ],
+  "auto_map": {
+    "AutoModel": "modeling_videochat2_classification.InternVideo2_Classification_test"
+  },
+  "bridge": {
+    "extra_num_query_token": 64,
+    "name": "qformer",
+    "num_query_token": 32,
+    "qformer_attention_probs_dropout_prob": 0.1,
+    "qformer_drop_path_rate": 0.2,
+    "qformer_hidden_dropout_prob": 0.1
+  },
+  "freeze_bridge": false,
+  "freeze_llm": false,
+  "freeze_vision_encoder": false,
+  "llm": {
+    "lora_alpha": 32,
+    "lora_dropout": 0.1,
+    "lora_r": 16,
+    "name": "mistral_7b",
+    "pretrained_llm_path": "mistralai/Mistral-7B-Instruct-v0.3",
+    "use_lora": true
+  },
+  "loss": {
+    "use_vision_regression_loss": false
+  },
+  "model_config": {},
+  "model_type": "InternVideo2_Classification_test",
+  "pretrained_paths": {},
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.1",
+  "use_flash_attention": true,
+  "vision_encoder": {
+    "checkpoint_num": 48,
+    "d_model": 1408,
+    "encoder_embed_dim": 1408,
+    "img_size": 224,
+    "name": "internvideo2-1B",
+    "num_frames": 8,
+    "origin_num_frames": 4,
+    "patch_size": 14,
+    "pretrained": null,
+    "sep_image_video_pos_embed": true,
+    "tubelet_size": 1,
+    "use_checkpoint": true,
+    "vit_add_ln": true,
+    "x_vis_only": true,
+    "x_vis_return_idx": -2
+  }
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2e92eec0623bf8e345a2310b4baff5fd2ecb0897a3b6eb94e5de89951a2de3c
+size 42488

modeling_videochat2_classification.py ADDED Viewed

	@@ -0,0 +1,420 @@

+import os
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.cuda.amp import autocast as autocast
+from typing import Optional
+from modeling_internvideo2_vit import pretrain_internvideo2_giant_patch14_224_clean
+from modeling_qformer import build_qformer
+# from .flash_attention_class import FlashAttention
+from model_config import VideoChat2Config
+from transformers import AutoTokenizer,AutoModel, AutoConfig, PreTrainedModel, PretrainedConfig
+import logging
+logger = logging.getLogger(__name__)
+token = os.environ['HF_TOKEN']
+IMG_TOKEN = "[<IMG_PLH>]"
+VID_TOKEN = "[<VID_PLH>]"
+DEFAULT_PAD_TOKEN = "[PAD]"
+DEFAULT_BOS_TOKEN = '<s>'
+DEFAULT_EOS_TOKEN = '</s>'
+DEFAULT_UNK_TOKEN = "<unk>"
+DEFAULT_IMAGE_TOKEN = "[IMAGETOKEN]"
+DEFAULT_VIDEO_TOKEN = "[VIDEOTOKEN]"
+DEFAULT_IMG_PLACEHOLDER = "[<IMG_PLH>]"
+DEFAULT_VID_PLACEHOLDER = "[<VID_PLH>]"
+def disabled_train(self, mode=True):
+    """Overwrite model.train with this function to make sure train/eval mode
+    does not change anymore."""
+    return self
+def freeze_module(module):
+    for _, param in module.named_parameters():
+        param.requires_grad = False
+    module = module.eval()
+    module.train = disabled_train
+    return module
+class InternVideo2_Classification(PreTrainedModel):
+    config_class = VideoChat2Config
+    def __init__(self, config):
+        self.model_config = config.model_config
+        # config.model_config = None
+        super().__init__(config)
+        self.build_vision_encoder()
+        self.build_llm()
+        self.build_bridge()
+        # NOTE place it after freeze llm
+        for n, p in self.named_parameters():
+            if p.requires_grad:
+                logger.info(f'{n} requires_grad')
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        image: Optional[torch.Tensor] = None,
+        video: Optional[torch.Tensor] = None,
+        instruction = None,
+        video_idx = None,
+        image_idx = None,
+    ):
+        if self.use_vision_regression_loss:
+            text_embeds, visual, visual_idx = self.pad_text_embeds(input_ids=input_ids, image=image,video=video, return_visual=True, video_idx=video_idx, image_idx=image_idx, instruction = instruction)
+        else:
+            text_embeds = self.pad_text_embeds(input_ids=input_ids, image=image, video=video, return_visual=False, video_idx=video_idx, image_idx=image_idx,  instruction = instruction)
+        outputs = self.lm(
+            inputs_embeds=text_embeds,
+            attention_mask=attention_mask,
+            labels=labels,
+            output_hidden_states=True,
+            return_dict=True,
+        )
+        return outputs
+    def build_vision_encoder(self):
+        # load pretrained internvideo2-1b here, simplified as it receives no args
+        # note that we haven't load the internvideo pretrained version
+        if 'internvideo2' in self.model_config.vision_encoder.name.lower():
+            encoder_name = self.model_config.vision_encoder.name
+            logger.info(f"Build vision_encoder: {encoder_name}")
+            if encoder_name == 'internvideo2-1B':
+                self.vision_encoder = pretrain_internvideo2_giant_patch14_224_clean(self.model_config)
+            else:
+                raise ValueError(f"Not implemented: {encoder_name}")
+        else:
+            raise NotImplementedError(self.model_config.vision_encoder.name)
+        if self.model_config.vision_encoder.vit_add_ln:
+            self.vision_layernorm = nn.LayerNorm(self.model_config.vision_encoder.encoder_embed_dim, eps=1e-12)
+        else:
+            self.vision_layernorm = nn.Identity()
+        self.freeze_vision_encoder = self.model_config.get("freeze_vision_encoder", False)
+        if self.freeze_vision_encoder:
+            logger.info("freeze vision encoder")
+            freeze_module(self.vision_encoder)
+            freeze_module(self.vision_layernorm)
+    def build_bridge(self):
+        # ViT to LM: 1792 -> 6656 NOTE 768 is qformer dim
+        self.project_up = nn.Linear(768, self.lm.config.hidden_size) # whether bias is needed?
+        # LM to ViT: 6656 -> 1792
+        self.project_down = nn.Linear(self.lm.config.hidden_size, 768)
+        if 'qformer' in self.model_config.bridge.name.lower():
+            from transformers import BertTokenizer
+            self.qformer_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", truncation_side="left")
+            self.qformer_tokenizer.add_special_tokens({"bos_token": "[DEC]"})
+            self.qformer_tokenizer.padding_side = "left"
+            if self.model_config.bridge.name == 'qformer':
+                self.qformer, self.query_tokens = build_qformer(
+                        self.model_config.bridge.num_query_token, self.model_config.vision_encoder.encoder_embed_dim,
+                        qformer_hidden_dropout_prob=self.model_config.bridge.qformer_hidden_dropout_prob,
+                        qformer_attention_probs_dropout_prob=self.model_config.bridge.qformer_attention_probs_dropout_prob,
+                        qformer_drop_path_rate=self.model_config.bridge.qformer_drop_path_rate,
+                )
+            self.qformer.resize_token_embeddings(len(self.qformer_tokenizer))
+            self.qformer.cls = None
+            self.extra_num_query_token = self.model_config.bridge.extra_num_query_token
+            if self.model_config.bridge.extra_num_query_token > 0:
+                logger.info(f"Add extra {self.model_config.bridge.extra_num_query_token} tokens in QFormer")
+                self.extra_query_tokens = nn.Parameter(
+                    torch.zeros(1, self.model_config.bridge.extra_num_query_token, self.query_tokens.shape[-1])
+                )
+            self.freeze_bridge = self.model_config.get("freeze_bridge", False)
+            if self.freeze_bridge:
+                logger.info("freeze bridge")
+                freeze_module(self.qformer)
+                self.query_tokens.requires_grad = False
+    def build_llm(self):
+        self.lm_name = self.model_config.llm.name
+        if self.model_config.llm.name == 'mistral_7b':
+            from transformers import AutoModelForSequenceClassification
+            config = AutoConfig.from_pretrained(
+                self.model_config.llm.pretrained_llm_path,
+                torch_dtype=torch.bfloat16,
+                token=token,
+                # attn_implementation="flash_attention_2",
+            )
+            self.lm = AutoModelForSequenceClassification.from_config(config)
+        elif self.model_config.llm.name == 'internlm_20b':
+            from transformers import AutoModelForSequenceClassification
+            self.lm = AutoModelForSequenceClassification.from_pretrained(
+                self.model_config.llm.pretrained_llm_path,
+                torch_dtype=torch.bfloat16,
+                trust_remote_code=True,
+            )
+            self.lm.gradient_checkpointing = True
+            self.lm._set_gradient_checkpointing()
+        elif self.model_config.llm.name == 'internlm2_5_7b':
+            from transformers import AutoModelForSequenceClassification
+            self.lm = AutoModelForSequenceClassification.from_pretrained(
+                self.model_config.llm.pretrained_llm_path,
+                torch_dtype=torch.bfloat16,
+                trust_remote_code=True,
+                local_files_only=True,
+            )
+        else:
+            raise NotImplementedError(self.model_config.llm.name)
+        self.freeze_llm = self.model_config.get("freeze_llm", True)
+        logger.info(f'freeze_llm: {self.freeze_llm}')
+        if self.freeze_llm:
+            logger.info("freeze llm")
+            freeze_module(self.lm)
+        if self.model_config.llm.use_lora:
+            self.use_lora = True
+            from peft import get_peft_model, LoraConfig, TaskType
+            logger.info("Use lora")
+            if self.model_config.llm.name == 'internlm_20b':
+                peft_config = LoraConfig(
+                    task_type=TaskType.CAUSAL_LM, inference_mode=False,
+                    r=self.model_config.llm.lora_r, lora_alpha=self.model_config.llm.lora_alpha, lora_dropout=self.model_config.llm.lora_dropout,
+                    target_modules=['wqkv', 'wo', 'w1', 'w2', 'w3', 'output']
+                )
+            else:
+                peft_config = LoraConfig(
+                    task_type=TaskType.CAUSAL_LM, inference_mode=False,
+                    r=self.model_config.llm.lora_r, lora_alpha=self.model_config.llm.lora_alpha, lora_dropout=self.model_config.llm.lora_dropout,
+                    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                                    "gate_proj", "up_proj", "down_proj", "lm_head"]
+                )
+            self.lm = get_peft_model(self.lm, peft_config)
+            self.lm.enable_input_require_grads()
+            self.lm.print_trainable_parameters()
+        else:
+            self.use_lora = False
+    def build_conversation(self,instruction, user_prompt,media_type='video',msg=''):
+        conversation = ""
+        if instruction:
+            conversation += instruction
+        conversation += ("[INST]" + " ")
+        if media_type == 'image':
+            conversation +=( "<Image>" + IMG_TOKEN + "</Image>")#*ilen
+        else:
+            conversation += ("<Video>" + VID_TOKEN + "</Video>")#*ilen
+        conversation += (msg.rstrip() + "[/INST]")
+        conversation += (" [INST] " + user_prompt + " [/INST]")
+        conversation += ("")
+        return conversation
+    def pad_text_embeds(
+        self,
+        input_ids: torch.LongTensor = None,
+        image: Optional[torch.Tensor] = None,
+        video: Optional[torch.Tensor] = None,
+        image_idx = None,
+        video_idx = None,
+        return_visual: bool = False,
+        instruction = None,
+    ):
+        # text_embeds
+        text_embeds = self.lm.get_input_embeddings()(input_ids.long()).detach()
+        visual = None
+        visual_idx = None
+        if image is not None:
+            B, T, C, H, W = image.shape
+            image = image.permute(0, 2, 1, 3, 4)
+            prompt_image_embeds = self.encode_vision(image, instruction=instruction)
+            visual = prompt_image_embeds
+            prompt_image_embeds = self.project_up(prompt_image_embeds)
+            prompt_image_embeds = prompt_image_embeds.view(-1, prompt_image_embeds.shape[-1])
+            visual_idx = image_idx
+            text_embeds[image_idx == 1] = text_embeds[image_idx == 1] * 0 + prompt_image_embeds.to(text_embeds.device)
+        elif video is not None:
+            if len(video.shape) == 5:
+                B, T, C, H, W = video.shape
+                N = 1
+            else:
+                B, N, T, C, H, W = video.shape
+            video = video.reshape(B*N, T, C, H, W).permute(0, 2, 1, 3, 4)
+            prompt_video_embeds = self.encode_vision(video, instruction=instruction)
+            visual = prompt_video_embeds
+            prompt_video_embeds = self.project_up(prompt_video_embeds)
+            prompt_video_embeds = prompt_video_embeds.view(-1, prompt_video_embeds.shape[-1])
+            visual_idx = video_idx
+            text_embeds[video_idx == 1] = text_embeds[video_idx == 1] * 0 + prompt_video_embeds.to(text_embeds.device).to(text_embeds.dtype)
+        else:
+            logger.warn(f"don't get visual input, input_ids: {input_ids}")
+        if return_visual:
+            return text_embeds, visual, visual_idx
+        return text_embeds
+    def encode_vision(
+        self,
+        image,
+        instruction
+    ):
+        device = image.device
+        B = image.shape[0]
+        T = image.shape[2]
+        use_image = True if T == 1 else False
+        image_embeds = self.vision_encoder(image, use_image=use_image)
+        C = image_embeds.shape[-1]
+        image_embeds = image_embeds.reshape(B, -1, C)
+        image_embeds = self.vision_layernorm(image_embeds).to(device)  # [B, T*L, C]
+        image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(device)
+        if self.extra_num_query_token > 0:
+            query_tokens = torch.cat([self.query_tokens, self.extra_query_tokens], dim=1)
+        query_tokens = query_tokens.expand(image_embeds.shape[0], -1, -1)
+        if instruction is not None:
+            text_Qformer = self.qformer_tokenizer(
+                instruction,
+                padding='longest',
+                truncation=True,
+                max_length=512,
+                return_tensors="pt",
+            ).to(image_embeds.device)
+            query_atts = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to(image_embeds.device)
+            Qformer_atts = torch.cat([query_atts, text_Qformer.attention_mask], dim=1)
+            query_output = self.qformer.bert(
+                text_Qformer.input_ids,
+                attention_mask=Qformer_atts,
+                query_embeds=query_tokens,
+                encoder_hidden_states=image_embeds,
+                encoder_attention_mask=image_atts,
+                return_dict=True,
+            )
+        else:
+            query_output = self.qformer.bert(
+                query_embeds=query_tokens,
+                encoder_hidden_states=image_embeds,
+                encoder_attention_mask=image_atts,
+                return_dict=True,
+            )
+        return query_output.last_hidden_state[:, :query_tokens.size(1), :]
+    def build_input_ids(
+            self,
+            tokenizer,
+            conversation,
+            max_length,
+            add_special_tokens,
+            truncation,
+            image = None,
+            video = None,
+            padding = "longest",
+            return_tensors = "pt",
+            image_placeholder: str = DEFAULT_IMG_PLACEHOLDER,
+            video_placeholder: str = DEFAULT_VID_PLACEHOLDER,
+    ):
+        input_ids = []
+        indexs = []
+        attention_mask = []
+        start, total_len = 0, 0
+        while True:
+            index1 = conversation.find(image_placeholder, start)
+            index2 = conversation.find(video_placeholder, start)
+            if index1 == -1 and index2 == -1:
+                index = -1
+            elif index1 == -1:
+                index = index2
+            elif index2 == -1:
+                index = index1
+            else:
+                index = min(index1, index2)
+                assert index != -1
+            if index == -1:
+                inputs = tokenizer(conversation[start:], max_length=max_length-total_len, truncation=truncation, padding=padding, return_tensors=return_tensors)
+            else:
+                inputs = tokenizer(conversation[start:index], max_length=max_length,  truncation=truncation, padding='longest', return_tensors=return_tensors)
+            input_ids += inputs.input_ids
+            attention_mask += inputs.attention_mask
+            total_len += inputs.input_ids[0].shape[0]
+            indexs += torch.zeros_like(inputs.input_ids)
+            if index != -1:
+                input_ids += [torch.zeros(96).long()]
+                attention_mask += [torch.ones(96).long()]
+                indexs += [torch.ones(96)]
+            if index == -1:
+                return {
+                    'input_ids': torch.cat(input_ids),
+                    'attention_mask': torch.cat(attention_mask),
+                    'index': torch.cat(indexs).to(torch.bool),
+                }
+            start = index + len(DEFAULT_IMG_PLACEHOLDER)
+    @property
+    def dtype(self):
+        return self.lm.dtype
+    @property
+    def device(self):
+        return self.lm.device
+class InternVideo2_Classification_test(PreTrainedModel):
+    config_class = VideoChat2Config
+    def __init__(self, config):
+        super().__init__(config)
+        self.conv1 = nn.Conv2d(1, 20, 5)
+        self.conv2 = nn.Conv2d(20, 20, 5)
+    def forward(self, x):
+        x = self.conv1(x)
+        return self.conv2(x)
+    def test_lol(self, x):
+        return x
+if __name__ == "__main__":
+    tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B',trust_remote_code=True,use_fast=False)
+    config = AutoConfig.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', torch_dtype=torch.bfloat16,trust_remote_code=True)
+    model = InternVideo2_Classification(config).cuda()
+    B, T, C, H, W = 1, 8, 3, 224, 224
+    video_tensor = torch.randn(B,T,C,H,W).cuda()
+    user_prompt = "this is a user prompt"
+    instruction = "this is an instruction"
+    conversation = model.build_conversation(instruction=instruction, user_prompt=user_prompt, media_type='video')
+    tokenized = model.build_input_ids(tokenizer,conversation,max_length=248,add_special_tokens=True,truncation=False,padding=False,return_tensors='pt')
+    input_ids = tokenized['input_ids'].unsqueeze(0).to(model.device)
+    attn_mask = tokenized['attention_mask'].unsqueeze(0).to(model.device)
+    indexes = tokenized['index'].unsqueeze(0)
+    text_embeds = model.pad_text_embeds(input_ids = input_ids,video = video_tensor,video_idx = indexes)
+    outputs = model.lm(inputs_embeds=text_embeds, attention_mask=attn_mask,output_hidden_states=True,return_dict=True)