Your Name commited on Mar 7

Commit

1dd81d0

•

1 Parent(s): f4c9b39

init

Browse files

Files changed (32) hide show

lora_and_projectors/llm_adapter/README.md +202 -0
lora_and_projectors/llm_adapter/adapter_config.json +32 -0
lora_and_projectors/llm_adapter/adapter_model.safetensors +3 -0
lora_and_projectors/projector/config.json +17 -0
lora_and_projectors/projector/configuration_projector.py +23 -0
lora_and_projectors/projector/model.safetensors +3 -0
lora_and_projectors/projector/modeling_projector.py +51 -0
lora_and_projectors/visual_encoder_adapter/README.md +202 -0
lora_and_projectors/visual_encoder_adapter/adapter_config.json +35 -0
lora_and_projectors/visual_encoder_adapter/adapter_model.safetensors +3 -0
lora_and_projectors/xtuner_config.py +222 -0
mmbench_results/20240307_093448/args.json +19 -0
mmbench_results/20240307_093841/args.json +19 -0
mmbench_results/20240307_094246/args.json +19 -0
mmbench_results/20240307_100202/args.json +19 -0
mmbench_results/20240307_100202/mmbench_result.json +9 -0
mmbench_results/20240307_100202/mmbench_result.xlsx +0 -0
mmbench_results/20240307_100541/args.json +19 -0
mmbench_results/20240307_100541/mmbench_result.xlsx +0 -0
mmbench_results/20240307_101151/args.json +19 -0
mmbench_results/20240307_101151/mmbench_result.json +9 -0
mmbench_results/20240307_101151/mmbench_result.xlsx +0 -0
mmbench_results/20240307_101718/args.json +19 -0
mmbench_results/20240307_101718/mmbench_result.xlsx +0 -0
mmbench_results/20240307_102207/args.json +19 -0
mmbench_results/20240307_102207/mmbench_result.json +10 -0
mmbench_results/20240307_102207/mmbench_result.xlsx +0 -0
modified_transformers/src/transformers/models/siglip/modeling_siglip.py +1299 -0
modified_xtuner/xtuner/dataset/huggingface.py +316 -0
modified_xtuner/xtuner/dataset/llava.py +88 -0
modified_xtuner/xtuner/tools/chat.py +491 -0
modified_xtuner/xtuner/tools/mmbench.py +510 -0

lora_and_projectors/llm_adapter/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+library_name: peft
+base_model: internlm/internlm2-chat-1_8b
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.9.1.dev0

lora_and_projectors/llm_adapter/adapter_config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 256,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 512,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "w3",
+    "output",
+    "wo",
+    "w2",
+    "wqkv",
+    "w1"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

lora_and_projectors/llm_adapter/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0990facba0e58ce5084c2b2425c0bd39160f1ddd06a74d2e267fd9e20ac15c07
+size 1103527968

lora_and_projectors/projector/config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "architectures": [
+    "ProjectorModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_projector.ProjectorConfig",
+    "AutoModel": "modeling_projector.ProjectorModel"
+  },
+  "bias": true,
+  "depth": 2,
+  "hidden_act": "gelu",
+  "llm_hidden_size": 2048,
+  "model_type": "projector",
+  "torch_dtype": "float32",
+  "transformers_version": "4.39.0.dev0",
+  "visual_hidden_size": 1152
+}

lora_and_projectors/projector/configuration_projector.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+from transformers import PretrainedConfig
+class ProjectorConfig(PretrainedConfig):
+    model_type = 'projector'
+    _auto_class = 'AutoConfig'
+    def __init__(
+        self,
+        visual_hidden_size=4096,
+        llm_hidden_size=4096,
+        depth=2,
+        hidden_act='gelu',
+        bias=True,
+        **kwargs,
+    ):
+        self.visual_hidden_size = visual_hidden_size
+        self.llm_hidden_size = llm_hidden_size
+        self.depth = depth
+        self.hidden_act = hidden_act
+        self.bias = bias
+        super().__init__(**kwargs)

lora_and_projectors/projector/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91cb6b6f787c8f847eb1cc070f841809046a21e464b2af08069f624fb03b3aaf
+size 26231144

lora_and_projectors/projector/modeling_projector.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel
+from transformers.activations import ACT2FN
+from .configuration_projector import ProjectorConfig
+class ProjectorModel(PreTrainedModel):
+    _auto_class = 'AutoModel'
+    config_class = ProjectorConfig
+    base_model_prefix = 'model'
+    supports_gradient_checkpointing = True
+    def __init__(self, config: ProjectorConfig) -> None:
+        super().__init__(config)
+        self.gradient_checkpointing = False
+        modules = [
+            nn.Linear(
+                config.visual_hidden_size,
+                config.llm_hidden_size,
+                bias=config.bias)
+        ]
+        for _ in range(1, config.depth):
+            modules.append(ACT2FN[config.hidden_act])
+            modules.append(
+                nn.Linear(
+                    config.llm_hidden_size,
+                    config.llm_hidden_size,
+                    bias=config.bias))
+        self.model = nn.Sequential(*modules)
+    def enable_input_require_grads(self):
+        def make_inputs_require_grad(module, input, output):
+            output.requires_grad_(True)
+        self.model.register_forward_hook(make_inputs_require_grad)
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, ProjectorModel):
+            module.gradient_checkpointing = value
+    def forward(self, x):
+        if self.gradient_checkpointing and self.training:
+            layer_outputs = torch.utils.checkpoint.checkpoint(self.model, x)
+        else:
+            layer_outputs = self.model(x)
+        return layer_outputs

lora_and_projectors/visual_encoder_adapter/README.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+library_name: peft
+base_model: google/siglip-so400m-patch14-384
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.9.1.dev0

lora_and_projectors/visual_encoder_adapter/adapter_config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": {
+    "base_model_class": "SiglipVisionModel",
+    "parent_library": "transformers.models.siglip.modeling_siglip"
+  },
+  "base_model_name_or_path": "google/siglip-so400m-patch14-384",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 16,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 64,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "fc1",
+    "out_proj",
+    "k_proj",
+    "v_proj",
+    "fc2"
+  ],
+  "task_type": null,
+  "use_dora": false,
+  "use_rslora": false
+}

lora_and_projectors/visual_encoder_adapter/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c4202e05cca9fa3c97b88f1ada8fd4aeb6a2f433aa4b517952269225f08608fa
+size 142556624

lora_and_projectors/xtuner_config.py ADDED Viewed

	@@ -0,0 +1,222 @@

+SYSTEM = ''
+accumulative_counts = 8
+batch_size = 4
+betas = (
+    0.9,
+    0.999,
+)
+custom_hooks = [
+    dict(
+        tokenizer=dict(
+            padding_side='right',
+            pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+            trust_remote_code=True,
+            type='transformers.AutoTokenizer.from_pretrained'),
+        type='xtuner.engine.hooks.DatasetInfoHook'),
+    dict(
+        evaluation_images='https://llava-vl.github.io/static/images/view.jpg',
+        evaluation_inputs=[
+            '请描述一下这张照片',
+            'Please describe this picture',
+        ],
+        every_n_iters=500,
+        image_processor=dict(
+            pretrained_model_name_or_path='google/siglip-so400m-patch14-384',
+            trust_remote_code=True,
+            type='transformers.SiglipImageProcessor.from_pretrained'),
+        prompt_template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat',
+        system='',
+        tokenizer=dict(
+            padding_side='right',
+            pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+            trust_remote_code=True,
+            type='transformers.AutoTokenizer.from_pretrained'),
+        type='xtuner.engine.hooks.EvaluateChatHook'),
+]
+data_path = './LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
+data_root = './'
+dataloader_num_workers = 4
+default_hooks = dict(
+    checkpoint=dict(
+        by_epoch=False,
+        interval=500,
+        max_keep_ckpts=2,
+        type='mmengine.hooks.CheckpointHook'),
+    logger=dict(
+        interval=10,
+        log_metric_by_epoch=False,
+        type='mmengine.hooks.LoggerHook'),
+    param_scheduler=dict(type='mmengine.hooks.ParamSchedulerHook'),
+    sampler_seed=dict(type='mmengine.hooks.DistSamplerSeedHook'),
+    timer=dict(type='mmengine.hooks.IterTimerHook'))
+env_cfg = dict(
+    cudnn_benchmark=False,
+    dist_cfg=dict(backend='nccl'),
+    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
+evaluation_freq = 500
+evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
+evaluation_inputs = [
+    '请描述一下这张照片',
+    'Please describe this picture',
+]
+image_folder = './llava_images'
+image_processor = dict(
+    pretrained_model_name_or_path='google/siglip-so400m-patch14-384',
+    trust_remote_code=True,
+    type='transformers.SiglipImageProcessor.from_pretrained')
+launcher = 'pytorch'
+llava_dataset = dict(
+    data_path='./LLaVA-Instruct-150K/llava_v1_5_mix665k.json',
+    dataset_map_fn='xtuner.dataset.map_fns.llava_map_fn',
+    image_folder='./llava_images',
+    image_processor=dict(
+        pretrained_model_name_or_path='google/siglip-so400m-patch14-384',
+        trust_remote_code=True,
+        type='transformers.SiglipImageProcessor.from_pretrained'),
+    max_length=1472,
+    pad_image_to_square=True,
+    template_map_fn=dict(
+        template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat',
+        type='xtuner.dataset.map_fns.template_map_fn_factory'),
+    tokenizer=dict(
+        padding_side='right',
+        pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+        trust_remote_code=True,
+        type='transformers.AutoTokenizer.from_pretrained'),
+    type='xtuner.dataset.LLaVADataset')
+llm_name_or_path = 'internlm/internlm2-chat-1_8b'
+load_from = None
+log_level = 'INFO'
+log_processor = dict(by_epoch=False)
+lr = 0.0002
+max_epochs = 1
+max_length = 1472
+max_norm = 1
+model = dict(
+    freeze_llm=True,
+    freeze_visual_encoder=True,
+    llm=dict(
+        pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+        quantization_config=dict(
+            bnb_4bit_compute_dtype='torch.float16',
+            bnb_4bit_quant_type='nf4',
+            bnb_4bit_use_double_quant=True,
+            llm_int8_has_fp16_weight=False,
+            llm_int8_threshold=6.0,
+            load_in_4bit=True,
+            load_in_8bit=False,
+            type='transformers.BitsAndBytesConfig'),
+        torch_dtype='torch.float16',
+        trust_remote_code=True,
+        type='transformers.AutoModelForCausalLM.from_pretrained'),
+    llm_lora=dict(
+        bias='none',
+        lora_alpha=256,
+        lora_dropout=0.05,
+        r=512,
+        task_type='CAUSAL_LM',
+        type='peft.LoraConfig'),
+    pretrained_pth='./work_dirs/pretrain/iter_8721.pth',
+    type='xtuner.model.LLaVAModel',
+    visual_encoder=dict(
+        pretrained_model_name_or_path='google/siglip-so400m-patch14-384',
+        type='transformers.SiglipVisionModel.from_pretrained'),
+    visual_encoder_lora=dict(
+        bias='none',
+        lora_alpha=16,
+        lora_dropout=0.05,
+        r=64,
+        type='peft.LoraConfig'))
+optim_type = 'torch.optim.AdamW'
+optim_wrapper = dict(
+    optimizer=dict(
+        betas=(
+            0.9,
+            0.999,
+        ),
+        lr=0.0002,
+        type='torch.optim.AdamW',
+        weight_decay=0),
+    type='DeepSpeedOptimWrapper')
+param_scheduler = [
+    dict(
+        begin=0,
+        by_epoch=True,
+        convert_to_iter_based=True,
+        end=0.03,
+        start_factor=1e-05,
+        type='mmengine.optim.LinearLR'),
+    dict(
+        begin=0.03,
+        by_epoch=True,
+        convert_to_iter_based=True,
+        end=1,
+        eta_min=0.0,
+        type='mmengine.optim.CosineAnnealingLR'),
+]
+prefetch = 5
+pretrained_pth = './work_dirs/pretrain/iter_8721.pth'
+prompt_template = 'xtuner.utils.PROMPT_TEMPLATE.internlm2_chat'
+randomness = dict(deterministic=False, seed=None)
+resume = False
+runner_type = 'FlexibleRunner'
+save_steps = 500
+save_total_limit = 2
+strategy = dict(
+    config=dict(
+        bf16=dict(enabled=True),
+        fp16=dict(enabled=False, initial_scale_power=16),
+        gradient_accumulation_steps='auto',
+        gradient_clipping='auto',
+        train_micro_batch_size_per_gpu='auto',
+        zero_allow_untested_optimizer=True,
+        zero_force_ds_cpu_optimizer=False,
+        zero_optimization=dict(overlap_comm=True, stage=2)),
+    exclude_frozen_parameters=True,
+    gradient_accumulation_steps=8,
+    gradient_clipping=1,
+    train_micro_batch_size_per_gpu=4,
+    type='xtuner.engine.DeepSpeedStrategy')
+tokenizer = dict(
+    padding_side='right',
+    pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+    trust_remote_code=True,
+    type='transformers.AutoTokenizer.from_pretrained')
+train_cfg = dict(max_epochs=1, type='xtuner.engine.runner.TrainLoop')
+train_dataloader = dict(
+    batch_size=4,
+    collate_fn=dict(type='xtuner.dataset.collate_fns.default_collate_fn'),
+    dataset=dict(
+        data_path='./LLaVA-Instruct-150K/llava_v1_5_mix665k.json',
+        dataset_map_fn='xtuner.dataset.map_fns.llava_map_fn',
+        image_folder='./llava_images',
+        image_processor=dict(
+            pretrained_model_name_or_path='google/siglip-so400m-patch14-384',
+            trust_remote_code=True,
+            type='transformers.SiglipImageProcessor.from_pretrained'),
+        max_length=1472,
+        pad_image_to_square=True,
+        template_map_fn=dict(
+            template='xtuner.utils.PROMPT_TEMPLATE.internlm2_chat',
+            type='xtuner.dataset.map_fns.template_map_fn_factory'),
+        tokenizer=dict(
+            padding_side='right',
+            pretrained_model_name_or_path='internlm/internlm2-chat-1_8b',
+            trust_remote_code=True,
+            type='transformers.AutoTokenizer.from_pretrained'),
+        type='xtuner.dataset.LLaVADataset'),
+    num_workers=4,
+    prefetch_factor=5,
+    sampler=dict(
+        length_property='modality_length',
+        per_device_batch_size=32,
+        type='xtuner.dataset.samplers.LengthGroupedSampler'))
+visual_encoder_name_or_path = 'google/siglip-so400m-patch14-384'
+visualizer = dict(
+    type='mmengine.visualization.Visualizer',
+    vis_backends=[
+        dict(type='mmengine.visualization.TensorboardVisBackend'),
+    ])
+warmup_ratio = 0.03
+weight_decay = 0
+work_dir = './work_dirs/finetune'

mmbench_results/20240307_093448/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_DEV_EN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_093841/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_DEV_EN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_094246/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_DEV_EN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_100202/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_DEV_EN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_100202/mmbench_result.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "Average": 0.6348797250859106,
+  "AR": 0.6482412060301508,
+  "CP": 0.7668918918918919,
+  "FP-C": 0.5804195804195804,
+  "FP-S": 0.6450511945392492,
+  "LR": 0.3728813559322034,
+  "RR": 0.5826086956521739
+}

mmbench_results/20240307_100202/mmbench_result.xlsx ADDED Viewed

Binary file (365 kB). View file

mmbench_results/20240307_100541/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_TEST_EN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_100541/mmbench_result.xlsx ADDED Viewed

Binary file (546 kB). View file

mmbench_results/20240307_101151/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_DEV_CN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_101151/mmbench_result.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "Average": 0.6288659793814433,
+  "AR": 0.6683417085427136,
+  "CP": 0.7567567567567568,
+  "FP-C": 0.5944055944055944,
+  "FP-S": 0.621160409556314,
+  "LR": 0.3474576271186441,
+  "RR": 0.5826086956521739
+}

mmbench_results/20240307_101151/mmbench_result.xlsx ADDED Viewed

Binary file (428 kB). View file

mmbench_results/20240307_101718/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "MMBench_TEST_CN.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_101718/mmbench_result.xlsx ADDED Viewed

Binary file (609 kB). View file

mmbench_results/20240307_102207/args.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "model_name_or_path": "internlm/internlm2-chat-1_8b",
+  "data_path": "CCBench.tsv",
+  "work_dir": "./work_dirs/finetune/bench_results/",
+  "llava": "./work_dirs/finetune/hf/",
+  "visual_encoder": "google/siglip-so400m-patch14-384",
+  "visual_select_layer": -2,
+  "prompt_template": "internlm2_chat",
+  "stop_words": [
+    "<|im_end|>"
+  ],
+  "torch_dtype": "fp16",
+  "bits": null,
+  "bot_name": "BOT",
+  "offload_folder": null,
+  "max_new_tokens": 100,
+  "seed": 0,
+  "launcher": "pytorch"
+}

mmbench_results/20240307_102207/mmbench_result.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "Average": 0.3627450980392157,
+  "Calligraphy Painting": 0.43859649122807015,
+  "Cultural Relic": 0.30927835051546393,
+  "Food & Clothes": 0.4434782608695652,
+  "Historical Figure": 0.11428571428571428,
+  "Scenery & Building": 0.28421052631578947,
+  "Sketch Reasoning": 0.6,
+  "Traditional Show": 0.3181818181818182
+}

mmbench_results/20240307_102207/mmbench_result.xlsx ADDED Viewed

Binary file (115 kB). View file

modified_transformers/src/transformers/models/siglip/modeling_siglip.py ADDED Viewed

	@@ -0,0 +1,1299 @@

+# coding=utf-8
+# Copyright 2024 Google AI and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Siglip model."""
+import math
+import warnings
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple, Union
+import numpy as np
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+from torch.nn.init import _calculate_fan_in_and_fan_out
+from ...activations import ACT2FN
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
+from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_siglip import SiglipConfig, SiglipTextConfig, SiglipVisionConfig
+logger = logging.get_logger(__name__)
+# General docstring
+_CONFIG_FOR_DOC = "SiglipConfig"
+_CHECKPOINT_FOR_DOC = "google/siglip-base-patch16-224"
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "google/siglip-base-patch16-224"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "LABEL_1"
+SIGLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/siglip-base-patch16-224",
+    # See all SigLIP models at https://huggingface.co/models?filter=siglip
+]
+def _trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn(
+            "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+            "The distribution of values may be incorrect.",
+            stacklevel=2,
+        )
+    # Values are generated by using a truncated uniform distribution and
+    # then using the inverse CDF for the normal distribution.
+    # Get upper and lower cdf values
+    l = norm_cdf((a - mean) / std)
+    u = norm_cdf((b - mean) / std)
+    # Uniformly fill tensor with values from [l, u], then translate to
+    # [2l-1, 2u-1].
+    tensor.uniform_(2 * l - 1, 2 * u - 1)
+    # Use inverse cdf transform for normal distribution to get truncated
+    # standard normal
+    tensor.erfinv_()
+    # Transform to proper mean, std
+    tensor.mul_(std * math.sqrt(2.0))
+    tensor.add_(mean)
+    # Clamp to ensure it's in the proper range
+    tensor.clamp_(min=a, max=b)
+def trunc_normal_tf_(
+    tensor: torch.Tensor, mean: float = 0.0, std: float = 1.0, a: float = -2.0, b: float = 2.0
+) -> torch.Tensor:
+    """Fills the input Tensor with values drawn from a truncated
+    normal distribution. The values are effectively drawn from the
+    normal distribution :math:`\\mathcal{N}(\text{mean}, \text{std}^2)`
+    with values outside :math:`[a, b]` redrawn until they are within
+    the bounds. The method used for generating the random values works
+    best when :math:`a \\leq \text{mean} \\leq b`.
+    NOTE: this 'tf' variant behaves closer to Tensorflow / JAX impl where the
+    bounds [a, b] are applied when sampling the normal distribution with mean=0, std=1.0
+    and the result is subsquently scaled and shifted by the mean and std args.
+    Args:
+        tensor: an n-dimensional `torch.Tensor`
+        mean: the mean of the normal distribution
+        std: the standard deviation of the normal distribution
+        a: the minimum cutoff value
+        b: the maximum cutoff value
+    """
+    with torch.no_grad():
+        _trunc_normal_(tensor, 0, 1.0, a, b)
+        tensor.mul_(std).add_(mean)
+def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
+    if mode == "fan_in":
+        denom = fan_in
+    elif mode == "fan_out":
+        denom = fan_out
+    elif mode == "fan_avg":
+        denom = (fan_in + fan_out) / 2
+    variance = scale / denom
+    if distribution == "truncated_normal":
+        # constant is stddev of standard normal truncated to (-2, 2)
+        trunc_normal_tf_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
+    elif distribution == "normal":
+        with torch.no_grad():
+            tensor.normal_(std=math.sqrt(variance))
+    elif distribution == "uniform":
+        bound = math.sqrt(3 * variance)
+        with torch.no_grad():
+            tensor.uniform_(-bound, bound)
+    else:
+        raise ValueError(f"invalid distribution {distribution}")
+def lecun_normal_(tensor):
+    variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
+def default_flax_embed_init(tensor):
+    variance_scaling_(tensor, mode="fan_in", distribution="normal")
+@dataclass
+# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Siglip
+class SiglipVisionModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+    Args:
+        image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    image_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+@dataclass
+# Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Siglip
+class SiglipTextModelOutput(ModelOutput):
+    """
+    Base class for text model's outputs that also contains a pooling of the last hidden states.
+    Args:
+        text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The text embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    text_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+@dataclass
+# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->Siglip
+class SiglipOutput(ModelOutput):
+    """
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
+            Contrastive loss for image-text similarity.
+        logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
+            The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
+            similarity scores.
+        logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
+            The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
+            similarity scores.
+        text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The text embeddings obtained by applying the projection layer to the pooled output of [`SiglipTextModel`].
+        image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
+            The image embeddings obtained by applying the projection layer to the pooled output of [`SiglipVisionModel`].
+        text_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`SiglipTextModel`].
+        vision_model_output(`BaseModelOutputWithPooling`):
+            The output of the [`SiglipVisionModel`].
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits_per_image: torch.FloatTensor = None
+    logits_per_text: torch.FloatTensor = None
+    text_embeds: torch.FloatTensor = None
+    image_embeds: torch.FloatTensor = None
+    text_model_output: BaseModelOutputWithPooling = None
+    vision_model_output: BaseModelOutputWithPooling = None
+    def to_tuple(self) -> Tuple[Any]:
+        return tuple(
+            self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
+            for k in self.keys()
+        )
+class SiglipVisionEmbeddings(nn.Module):
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.patch_embedding = nn.Conv2d(
+            in_channels=config.num_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            padding="valid",
+        )
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
+    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
+        target_dtype = self.patch_embedding.weight.dtype
+        patch_embeds = self.patch_embedding(pixel_values.to(target_dtype))  # shape = [*, width, grid, grid]
+        embeddings = patch_embeds.flatten(2).transpose(1, 2)
+        embeddings = embeddings + self.position_embedding(self.position_ids)
+        return embeddings
+# Copied from transformers.models.clip.modeling_clip.CLIPTextEmbeddings with CLIP->Siglip
+class SiglipTextEmbeddings(nn.Module):
+    def __init__(self, config: SiglipTextConfig):
+        super().__init__()
+        embed_dim = config.hidden_size
+        self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
+        self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+    ) -> torch.Tensor:
+        seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+        if inputs_embeds is None:
+            inputs_embeds = self.token_embedding(input_ids)
+        position_embeddings = self.position_embedding(position_ids)
+        embeddings = inputs_embeds + position_embeddings
+        return embeddings
+class SiglipAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    # Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = config.attention_dropout
+        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+        batch_size, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k_v_seq_len = key_states.shape[-2]
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
+        if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
+            raise ValueError(
+                f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+        if attention_mask is not None:
+            if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = attn_weights + attention_mask
+        # upcast attention to fp32
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+        attn_output = torch.matmul(attn_weights, value_states)
+        if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
+        attn_output = self.out_proj(attn_output)
+        return attn_output, attn_weights
+# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->Siglip
+class SiglipMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+# Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->Siglip
+class SiglipEncoderLayer(nn.Module):
+    def __init__(self, config: SiglipConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.self_attn = SiglipAttention(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        self.mlp = SiglipMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+    # Ignore copy
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`):
+                Input to the layer of shape `(batch, seq_len, embed_dim)`.
+            attention_mask (`torch.FloatTensor`):
+                Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (attn_weights,)
+        return outputs
+class SiglipPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = SiglipConfig
+    base_model_prefix = "siglip"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, SiglipVisionEmbeddings):
+            width = (
+                self.config.vision_config.hidden_size
+                if isinstance(self.config, SiglipConfig)
+                else self.config.hidden_size
+            )
+            nn.init.normal_(module.position_embedding.weight, std=1 / np.sqrt(width))
+        elif isinstance(module, nn.Embedding):
+            default_flax_embed_init(module.weight)
+        elif isinstance(module, SiglipAttention):
+            nn.init.xavier_uniform_(module.q_proj.weight)
+            nn.init.xavier_uniform_(module.k_proj.weight)
+            nn.init.xavier_uniform_(module.v_proj.weight)
+            nn.init.xavier_uniform_(module.out_proj.weight)
+            nn.init.zeros_(module.q_proj.bias)
+            nn.init.zeros_(module.k_proj.bias)
+            nn.init.zeros_(module.v_proj.bias)
+            nn.init.zeros_(module.out_proj.bias)
+        elif isinstance(module, SiglipMLP):
+            nn.init.xavier_uniform_(module.fc1.weight)
+            nn.init.xavier_uniform_(module.fc2.weight)
+            nn.init.normal_(module.fc1.bias, std=1e-6)
+            nn.init.normal_(module.fc2.bias, std=1e-6)
+        elif isinstance(module, SiglipMultiheadAttentionPoolingHead):
+            nn.init.xavier_uniform_(module.probe.data)
+            nn.init.xavier_uniform_(module.attention.in_proj_weight.data)
+            nn.init.zeros_(module.attention.in_proj_bias.data)
+        elif isinstance(module, SiglipModel):
+            logit_scale_init = torch.log(torch.tensor(1.0))
+            module.logit_scale.data.fill_(logit_scale_init)
+            module.logit_bias.data.zero_()
+        elif isinstance(module, (nn.Linear, nn.Conv2d)):
+            lecun_normal_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+SIGLIP_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`SiglipConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+SIGLIP_TEXT_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+SIGLIP_VISION_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+SIGLIP_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
+        return_loss (`bool`, *optional*):
+            Whether or not to return the contrastive loss.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+# Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->Siglip
+class SiglipEncoder(nn.Module):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`SiglipEncoderLayer`].
+    Args:
+        config: SiglipConfig
+    """
+    def __init__(self, config: SiglipConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    # Ignore copy
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        hidden_states = inputs_embeds
+        for encoder_layer in self.layers:
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    encoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
+        )
+class SiglipTextTransformer(nn.Module):
+    def __init__(self, config: SiglipTextConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.embeddings = SiglipTextEmbeddings(config)
+        self.encoder = SiglipEncoder(config)
+        self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self.head = nn.Linear(embed_dim, embed_dim)
+    @add_start_docstrings_to_model_forward(SIGLIP_TEXT_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipTextConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is None:
+            raise ValueError("You have to specify input_ids")
+        input_shape = input_ids.size()
+        input_ids = input_ids.view(-1, input_shape[-1])
+        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
+        # note: SigLIP's text model does not use a causal mask, unlike the original CLIP model.
+        # expand attention_mask
+        if attention_mask is not None:
+            # [batch_size, seq_len] -> [batch_size, 1, tgt_seq_len, src_seq_len]
+            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.final_layer_norm(last_hidden_state)
+        # Assuming "sticky" EOS tokenization, last token is always EOS.
+        pooled_output = last_hidden_state[:, -1, :]
+        pooled_output = self.head(pooled_output)
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+@add_start_docstrings(
+    """The text model from SigLIP without any head or projection on top.""",
+    SIGLIP_START_DOCSTRING,
+)
+class SiglipTextModel(SiglipPreTrainedModel):
+    config_class = SiglipTextConfig
+    _no_split_modules = ["SiglipTextEmbeddings", "SiglipEncoderLayer"]
+    def __init__(self, config: SiglipTextConfig):
+        super().__init__(config)
+        self.text_model = SiglipTextTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.text_model.embeddings.token_embedding
+    def set_input_embeddings(self, value):
+        self.text_model.embeddings.token_embedding = value
+    @add_start_docstrings_to_model_forward(SIGLIP_TEXT_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipTextConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from transformers import AutoTokenizer, SiglipTextModel
+        >>> model = SiglipTextModel.from_pretrained("google/siglip-base-patch16-224")
+        >>> tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224")
+        >>> # important: make sure to set padding="max_length" as that's how the model was trained
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding="max_length", return_tensors="pt")
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        return self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+class SiglipVisionTransformer(nn.Module):
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+        self.embeddings = SiglipVisionEmbeddings(config)
+        self.encoder = SiglipEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self.head = SiglipMultiheadAttentionPoolingHead(config)
+    @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
+    def forward(
+        self,
+        pixel_values,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        hidden_states = self.embeddings(pixel_values)
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+        pooled_output = self.head(last_hidden_state)
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+class SiglipMultiheadAttentionPoolingHead(nn.Module):
+    """Multihead Attention Pooling."""
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
+        self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
+        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp = SiglipMLP(config)
+    def forward(self, hidden_state):
+        batch_size = hidden_state.shape[0]
+        probe = self.probe.repeat(batch_size, 1, 1)
+        hidden_state = self.attention(probe, hidden_state, hidden_state)[0]
+        residual = hidden_state
+        hidden_state = self.layernorm(hidden_state)
+        hidden_state = residual + self.mlp(hidden_state)
+        return hidden_state[:, 0]
+@add_start_docstrings(
+    """The vision model from SigLIP without any head or projection on top.""",
+    SIGLIP_START_DOCSTRING,
+)
+class SiglipVisionModel(SiglipPreTrainedModel):
+    config_class = SiglipVisionConfig
+    main_input_name = "pixel_values"
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__(config)
+        self.vision_model = SiglipVisionTransformer(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.vision_model.embeddings.patch_embedding
+    @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
+    def forward(
+        self,
+        pixel_values,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, SiglipVisionModel
+        >>> model = SiglipVisionModel.from_pretrained("google/siglip-base-patch16-224")
+        >>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pt")
+        >>> outputs = model(**inputs)
+        >>> last_hidden_state = outputs.last_hidden_state
+        >>> pooled_output = outputs.pooler_output  # pooled features
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+@add_start_docstrings(SIGLIP_START_DOCSTRING)
+class SiglipModel(SiglipPreTrainedModel):
+    config_class = SiglipConfig
+    def __init__(self, config: SiglipConfig):
+        super().__init__(config)
+        if not isinstance(config.text_config, SiglipTextConfig):
+            raise ValueError(
+                "config.text_config is expected to be of type SiglipTextConfig but is of type"
+                f" {type(config.text_config)}."
+            )
+        if not isinstance(config.vision_config, SiglipVisionConfig):
+            raise ValueError(
+                "config.vision_config is expected to be of type SiglipVisionConfig but is of type"
+                f" {type(config.vision_config)}."
+            )
+        text_config = config.text_config
+        vision_config = config.vision_config
+        self.text_model = SiglipTextTransformer(text_config)
+        self.vision_model = SiglipVisionTransformer(vision_config)
+        self.logit_scale = nn.Parameter(torch.randn(1))
+        self.logit_bias = nn.Parameter(torch.randn(1))
+        # Initialize weights and apply final processing
+        self.post_init()
+    @add_start_docstrings_to_model_forward(SIGLIP_TEXT_INPUTS_DOCSTRING)
+    def get_text_features(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
+            applying the projection layer to the pooled output of [`SiglipTextModel`].
+        Examples:
+        ```python
+        >>> from transformers import AutoTokenizer, AutoModel
+        >>> import torch
+        >>> model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
+        >>> tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-224")
+        >>> # important: make sure to set padding="max_length" as that's how the model was trained
+        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding="max_length", return_tensors="pt")
+        >>> with torch.no_grad():
+        ...     text_features = model.get_text_features(**inputs)
+        ```"""
+        # Use SigLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = text_outputs[1]
+        return pooled_output
+    @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
+    def get_image_features(
+        self,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> torch.FloatTensor:
+        r"""
+        Returns:
+            image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
+            applying the projection layer to the pooled output of [`SiglipVisionModel`].
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, AutoModel
+        >>> import torch
+        >>> model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
+        >>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> inputs = processor(images=image, return_tensors="pt")
+        >>> with torch.no_grad():
+        ...     image_features = model.get_image_features(**inputs)
+        ```"""
+        # Use SiglipModel's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        pooled_output = vision_outputs[1]
+        return pooled_output
+    @add_start_docstrings_to_model_forward(SIGLIP_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=SiglipOutput, config_class=SiglipConfig)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        return_loss: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SiglipOutput]:
+        r"""
+        Returns:
+        Examples:
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from transformers import AutoProcessor, AutoModel
+        >>> import torch
+        >>> model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
+        >>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+        >>> texts = ["a photo of 2 cats", "a photo of 2 dogs"]
+        >>> # important: we pass `padding=max_length` since the model was trained with this
+        >>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
+        >>> with torch.no_grad():
+        ...     outputs = model(**inputs)
+        >>> logits_per_image = outputs.logits_per_image
+        >>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
+        >>> print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
+        31.9% that image 0 is 'a photo of 2 cats'
+        ```"""
+        # Use SigLIP model's config for some fields (if specified) instead of those of vision & text components.
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        vision_outputs = self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        text_outputs = self.text_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        image_embeds = vision_outputs[1]
+        text_embeds = text_outputs[1]
+        # normalized features
+        image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
+        text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
+        # cosine similarity as logits
+        logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * self.logit_scale.exp() + self.logit_bias
+        logits_per_image = logits_per_text.t()
+        loss = None
+        if return_loss:
+            raise NotImplementedError("SigLIP loss to be implemented")
+        if not return_dict:
+            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
+            return ((loss,) + output) if loss is not None else output
+        return SiglipOutput(
+            loss=loss,
+            logits_per_image=logits_per_image,
+            logits_per_text=logits_per_text,
+            text_embeds=text_embeds,
+            image_embeds=image_embeds,
+            text_model_output=text_outputs,
+            vision_model_output=vision_outputs,
+        )
+@add_start_docstrings(
+    """
+    SigLIP vision encoder with an image classification head on top (a linear layer on top of the pooled final hidden states of
+    the patch tokens) e.g. for ImageNet.
+    """,
+    SIGLIP_START_DOCSTRING,
+)
+class SiglipForImageClassification(SiglipPreTrainedModel):
+    main_input_name = "pixel_values"
+    def __init__(self, config: SiglipConfig) -> None:
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.vision_model = SiglipVisionTransformer(config.vision_config)
+        # Classifier head
+        self.classifier = (
+            nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()
+        )
+        # Initialize weights and apply final processing
+        self.post_init()
+    @add_start_docstrings_to_model_forward(SIGLIP_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=ImageClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def forward(
+        self,
+        pixel_values: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[tuple, ImageClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.vision_model(
+            pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        # average pool the patch tokens
+        sequence_output = torch.mean(sequence_output[:, 1:, :], dim=1)
+        # apply classifier
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return ImageClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modified_xtuner/xtuner/dataset/huggingface.py ADDED Viewed

	@@ -0,0 +1,316 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import logging
+import os
+from datetime import timedelta
+from functools import partial
+import numpy as np
+from datasets import DatasetDict, concatenate_datasets
+from mmengine import print_log
+from mmengine.config import Config, ConfigDict
+from mmengine.utils.misc import get_object_from_string
+from torch import distributed as dist
+from xtuner.registry import BUILDER, MAP_FUNC
+from .utils import Packer, encode_fn
+def get_lengths(example):
+    return {'length': len(example['input_ids'])}
+def build_origin_dataset(dataset, split):
+    if isinstance(dataset, DatasetDict):
+        if split is None:
+            dataset = concatenate_datasets(dataset.values())
+        else:
+            dataset = dataset[split]
+    elif isinstance(dataset, dict) or isinstance(
+            dataset, Config) or isinstance(dataset, ConfigDict):
+        dataset = BUILDER.build(dataset)
+        if isinstance(dataset, DatasetDict):
+            if split is None:
+                dataset = concatenate_datasets(dataset.values())
+            else:
+                dataset = dataset[split]
+    return dataset
+def map_dataset(dataset, dataset_map_fn, map_num_proc):
+    if isinstance(dataset_map_fn, str):
+        map_fn_obj = MAP_FUNC.get(dataset_map_fn) or get_object_from_string(
+            dataset_map_fn)
+        if map_fn_obj is not None:
+            dataset_map_fn = map_fn_obj
+        else:
+            raise TypeError('dataset_map_fn must be a function or a '
+                            "registered function's string in MAP_FUNC, "
+                            f"but got a string of '{dataset_map_fn}'")
+    dataset = dataset.map(dataset_map_fn, num_proc=map_num_proc)
+    return dataset
+def add_template_to_dataset(dataset, template_map_fn, map_num_proc):
+    if isinstance(template_map_fn,
+                  dict) or isinstance(template_map_fn, Config) or isinstance(
+                      template_map_fn, ConfigDict):
+        template_map_fn = BUILDER.build(template_map_fn)
+    dataset = dataset.map(template_map_fn, num_proc=map_num_proc)
+    # remove invalid data
+    dataset = dataset.filter(
+        lambda example: len(example['conversation']) > 0,
+        num_proc=map_num_proc)
+    return dataset
+def tokenize_dataset(dataset, tokenizer, max_length, with_image_token,
+                     input_ids_with_output, remove_unused_columns,
+                     map_num_proc):
+    assert (tokenizer is not None) and (max_length is not None), \
+        f'({tokenizer}, {max_length})'
+    if isinstance(tokenizer, dict) or isinstance(
+            tokenizer, Config) or isinstance(tokenizer, ConfigDict):
+        tokenizer = BUILDER.build(tokenizer)
+    dataset = dataset.map(
+        partial(
+            encode_fn,
+            tokenizer=tokenizer,
+            max_length=max_length,
+            with_image_token=with_image_token,
+            input_ids_with_output=input_ids_with_output),
+        remove_columns=list(dataset.column_names)
+        if remove_unused_columns else None,
+        num_proc=map_num_proc)
+    return dataset
+def pack_dataset(dataset, max_length, use_varlen_attn, shuffle_before_pack,
+                 map_num_proc):
+    if shuffle_before_pack:
+        dataset = dataset.shuffle()
+        dataset = dataset.flatten_indices(num_proc=map_num_proc)
+    dataset = dataset.map(
+        Packer(max_length, use_varlen_attn=use_varlen_attn),
+        batched=True,
+        num_proc=map_num_proc)
+    return dataset
+def process(dataset,
+            do_dataset_tokenization=True,
+            tokenizer=None,
+            max_length=None,
+            dataset_map_fn=None,
+            template_map_fn=None,
+            max_dataset_length=None,
+            split='train',
+            remove_unused_columns=False,
+            rename_maps=[],
+            shuffle_before_pack=True,
+            pack_to_max_length=True,
+            use_varlen_attn=False,
+            input_ids_with_output=True,
+            with_image_token=False,
+            map_num_proc=32):
+    """Post-process the dataset loaded from the Hugging Face Hub, or a local
+    dataset.
+    Args:
+        dataset: The dataset to be post-processed.
+        do_dataset_tokenization: Whether the dataset need to be tokenized
+            in this function. Default to True.
+        tokenizer: The tokenizer processes some raw text as input and outputs
+            an Encoding. If `do_dataset_tokenization` is True, this argument
+            should not be None. Default to None.
+        max_length: Max length of the sequence. If `do_dataset_tokenization`
+            or `pack_to_max_length` is True, this argument should not be None.
+            Default to None.
+        dataset_map_fn: Map the original dataset format to the one defined
+            by xTuner.
+        template_map_fn: Add the prompt template to the dataset
+        max_dataset_length: If the length of the dataset is too long, we can
+            randomly extract `max_dataset_length` from it.
+        split: Which split of the data to load.
+            If `None`, will return a single concatenated dataset with all
+            splits (typically `datasets.Split.TRAIN` and
+            `datasets.Split.TEST`).
+            If given, will return a single Dataset.
+        remove_unused_columns: Whether to remove columns from the dataset
+            that are not used during training.
+        rename_maps: Rename the column name of the dataset.
+        shuffle_before_pack: Whether to shuffle the dataset before
+            packing them.
+        pack_to_max_length: Whether to pack the dataset to the `max_length `.
+            This usually improves gpu utilization and therefore reduces
+            training time.
+        use_varlen_attn: If use_varlen_attn is True, we calculate attention
+            the actual length of the sequence rather than the actual length
+            of the sequence
+        input_ids_with_output: Whether to put the groundtruth output
+            corresponding to the question into the dataset. Typically set
+            it to True during training and False during testing.
+        with_image_token: Whether to convert DEFAULT_IMAGE_TOKEN to
+            IMAGE_TOKEN_INDEX. Typically set it to True during the training
+            of VLM.
+        map_num_proc: Max number of processes when mapping the dataset.
+    """
+    if use_varlen_attn:
+        assert pack_to_max_length, \
+            '`pack_to_max_length` in `process_hf_dataset` should be set to ' \
+            'True if `use_varlen_attn` is True.'
+    if pack_to_max_length:
+        assert split == 'train' or split is None, \
+            ('`split` should be `train` or `None` if `pack_to_max_length` is '
+             f'True, but got {split}.')
+    dataset = build_origin_dataset(dataset, split)
+    # sample `max_dataset_length` items from the original dataset to
+    # save time consumed by map function
+    if max_dataset_length is not None:
+        max_dataset_length = min(max_dataset_length, len(dataset))
+        indices = np.random.choice(
+            len(dataset), max_dataset_length, replace=False)
+        dataset = dataset.select(indices)
+    # Extract the useful data for training from the original dataset.
+    if dataset_map_fn is not None:
+        dataset = map_dataset(dataset, dataset_map_fn, map_num_proc)
+    # Add prompt template, such as <|System|>: xxx <|User|>: xxx <|Bot|>: xxx
+    if template_map_fn is not None:
+        dataset = add_template_to_dataset(dataset, template_map_fn,
+                                          map_num_proc)
+    for old, new in rename_maps:
+        dataset = dataset.rename_column(old, new)
+    # remove unused columns
+    if pack_to_max_length and (not remove_unused_columns):
+        print_log(
+            'We have to remove unused columns if '
+            '`pack_to_max_length` is set to True.',
+            logger='current',
+            level=logging.WARNING)
+        remove_unused_columns = True
+    if do_dataset_tokenization:
+        dataset = tokenize_dataset(dataset, tokenizer, max_length,
+                                   with_image_token, input_ids_with_output,
+                                   remove_unused_columns, map_num_proc)
+    else:
+        assert {'input_ids', 'labels'}.issubset(dataset.column_names)
+    if input_ids_with_output:
+        # remove data that does not have the valid labels.
+        dataset = dataset.filter(
+            lambda example: any(label >= 0 for label in example['labels']),
+            num_proc=map_num_proc)
+    # pack to max length
+    if pack_to_max_length:
+        dataset = pack_dataset(dataset, max_length, use_varlen_attn,
+                               shuffle_before_pack, map_num_proc)
+    # add 'length'
+    dataset = dataset.map(get_lengths, num_proc=map_num_proc)
+    setattr(dataset, 'length', dataset['length'])
+    return dataset
+def process_hf_dataset(dataset,
+                       do_dataset_tokenization=True,
+                       tokenizer=None,
+                       max_length=None,
+                       dataset_map_fn=None,
+                       template_map_fn=None,
+                       max_dataset_length=None,
+                       split='train',
+                       remove_unused_columns=False,
+                       rename_maps=[],
+                       shuffle_before_pack=True,
+                       pack_to_max_length=True,
+                       use_varlen_attn=False,
+                       input_ids_with_output=True,
+                       with_image_token=False,
+                       map_num_proc=4):
+    """Post-process the dataset loaded from the Hugging Face Hub, or a local
+    dataset.
+    Args:
+        dataset: The dataset to be post-processed.
+        do_dataset_tokenization: Whether the dataset need to be tokenized
+            in this function. Default to True.
+        tokenizer: The tokenizer processes some raw text as input and outputs
+            an Encoding. If `do_dataset_tokenization` is True, this argument
+            should not be None. Default to None.
+        max_length: Max length of the sequence. If `do_dataset_tokenization`
+            or `pack_to_max_length` is True, this argument should not be None.
+            Default to None.
+        dataset_map_fn: Map the original dataset format to the one defined
+            by xTuner.
+        template_map_fn: Add the prompt template to the dataset
+        max_dataset_length: If the length of the dataset is too long, we can
+            randomly extract `max_dataset_length` from it.
+        split: Which split of the data to load.
+            If `None`, will return a single concatenated dataset with all
+            splits (typically `datasets.Split.TRAIN` and
+            `datasets.Split.TEST`).
+            If given, will return a single Dataset.
+        remove_unused_columns: Whether to remove columns from the dataset
+            that are not used during training.
+        rename_maps: Rename the column name of the dataset.
+        shuffle_before_pack: Whether to shuffle the dataset before
+            packing them.
+        pack_to_max_length: Whether to pack the dataset to the `max_length `.
+            This usually improves gpu utilization and therefore reduces
+            training time.
+        use_varlen_attn: If use_varlen_attn is True, we calculate attention
+            the actual length of the sequence rather than the actual length
+            of the sequence
+        input_ids_with_output: Whether to put the groundtruth output
+            corresponding to the question into the dataset. Typically set
+            it to True during training and False during testing.
+        with_image_token: Whether to convert DEFAULT_IMAGE_TOKEN to
+            IMAGE_TOKEN_INDEX. Typically set it to True during the training
+            of VLM.
+        map_num_proc: Max number of processes when mapping the dataset.
+    """
+    kwargs = dict(
+        dataset=dataset,
+        do_dataset_tokenization=do_dataset_tokenization,
+        tokenizer=tokenizer,
+        max_length=max_length,
+        dataset_map_fn=dataset_map_fn,
+        template_map_fn=template_map_fn,
+        max_dataset_length=max_dataset_length,
+        split=split,
+        remove_unused_columns=remove_unused_columns,
+        rename_maps=rename_maps,
+        shuffle_before_pack=shuffle_before_pack,
+        pack_to_max_length=pack_to_max_length,
+        use_varlen_attn=use_varlen_attn,
+        input_ids_with_output=input_ids_with_output,
+        with_image_token=with_image_token,
+        map_num_proc=map_num_proc)
+    if not (dist.is_available() and dist.is_initialized()):
+        return process(**kwargs)
+    xtuner_dataset_timeout = timedelta(
+        minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=30)))
+    print_log(
+        f'xtuner_dataset_timeout = {xtuner_dataset_timeout}', logger='current')
+    # monitored barrier requires gloo process group to perform host-side sync.
+    group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout)
+    if dist.get_rank() == 0:
+        dataset = process(**kwargs)
+        objects = [dataset]
+    else:
+        objects = [None]
+    dist.monitored_barrier(group=group_gloo, timeout=xtuner_dataset_timeout)
+    dist.broadcast_object_list(objects, src=0)
+    return objects[0]

modified_xtuner/xtuner/dataset/llava.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import json
+import os
+import torch
+from datasets import Dataset as HFDataset
+from datasets import DatasetDict
+from mmengine.config import Config, ConfigDict
+from PIL import Image
+from torch.utils.data import Dataset
+from xtuner.registry import BUILDER
+from .huggingface import process_hf_dataset
+from .utils import expand2square
+class LLaVADataset(Dataset):
+    def __init__(self,
+                 data_path,
+                 image_folder,
+                 tokenizer,
+                 image_processor,
+                 max_dataset_length=None,
+                 dataset_map_fn=None,
+                 template_map_fn=None,
+                 max_length=2048,
+                 pad_image_to_square=False):
+        super().__init__()
+        json_data = json.load(open(data_path))
+        for idx in range(len(json_data)):
+            if isinstance(json_data[idx]['id'], int):
+                json_data[idx]['id'] = str(json_data[idx]['id'])
+        json_data = DatasetDict({'train': HFDataset.from_list(json_data)})
+        self.text_data = process_hf_dataset(
+            dataset=json_data,
+            tokenizer=tokenizer,
+            max_length=max_length,
+            dataset_map_fn=dataset_map_fn,
+            template_map_fn=template_map_fn,
+            split='train',
+            max_dataset_length=max_dataset_length,
+            remove_unused_columns=False,
+            pack_to_max_length=False,
+            with_image_token=True)
+        self.image_folder = image_folder
+        if isinstance(image_processor, dict) or isinstance(
+                image_processor, Config) or isinstance(image_processor,
+                                                       ConfigDict):
+            self.image_processor = BUILDER.build(image_processor)
+        else:
+            self.image_processor = image_processor
+        self.pad_image_to_square = pad_image_to_square
+    @property
+    def modality_length(self):
+        length_list = []
+        for data_dict in self.text_data:
+            cur_len = len(data_dict['input_ids'])
+            if data_dict.get('image', None) is None:
+                cur_len = -cur_len
+            length_list.append(cur_len)
+        return length_list
+    def __len__(self):
+        return len(self.text_data)
+    def __getitem__(self, index):
+        data_dict = self.text_data[index]
+        if data_dict.get('image', None) is not None:
+            image_file = data_dict['image']
+            image = Image.open(os.path.join(self.image_folder,
+                                            image_file)).convert('RGB')
+            if self.pad_image_to_square:
+                image = expand2square(
+                    image,
+                    tuple(
+                        int(x * 255) for x in self.image_processor.image_mean))
+            image = self.image_processor.preprocess(
+                image, return_tensors='pt')['pixel_values'][0]
+            data_dict['pixel_values'] = image
+        else:
+            size = self.image_processor.size
+            data_dict['pixel_values'] = torch.zeros(3, size['height'],
+                                                    size['width'])
+        return data_dict

modified_xtuner/xtuner/tools/chat.py ADDED Viewed

	@@ -0,0 +1,491 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os
+import os.path as osp
+import re
+import sys
+import torch
+from huggingface_hub import snapshot_download
+from peft import PeftModel
+from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
+                          BitsAndBytesConfig, SiglipImageProcessor,
+                          SiglipVisionModel, GenerationConfig)
+from transformers.generation.streamers import TextStreamer
+from xtuner.dataset.utils import expand2square, load_image
+from xtuner.model.utils import prepare_inputs_labels_for_multimodal
+from xtuner.tools.utils import get_stop_criteria
+from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
+                          PROMPT_TEMPLATE, SYSTEM_TEMPLATE)
+TORCH_DTYPE_MAP = dict(
+    fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32, auto='auto')
+def remove_prefix(state_dict, prefix):
+    new_state_dict = {}
+    for key, value in state_dict.items():
+        if key.startswith(prefix):
+            new_key = key[len(prefix):]
+            new_state_dict[new_key] = value
+        else:
+            new_state_dict[key] = value
+    return new_state_dict
+def parse_args():
+    parser = argparse.ArgumentParser(description='Chat with a HF model')
+    parser.add_argument(
+        'model_name_or_path', help='Hugging Face model name or path')
+    adapter_group = parser.add_mutually_exclusive_group()
+    adapter_group.add_argument(
+        '--adapter', default=None, help='adapter name or path')
+    adapter_group.add_argument(
+        '--llava', default=None, help='llava name or path')
+    parser.add_argument(
+        '--visual-encoder', default=None, help='visual encoder name or path')
+    parser.add_argument(
+        '--visual-select-layer', default=-2, help='visual select layer')
+    parser.add_argument('--image', default=None, help='image')
+    parser.add_argument(
+        '--torch-dtype',
+        default='fp16',
+        choices=TORCH_DTYPE_MAP.keys(),
+        help='Override the default `torch.dtype` and load the model under '
+        'a specific `dtype`.')
+    parser.add_argument(
+        '--prompt-template',
+        choices=PROMPT_TEMPLATE.keys(),
+        default=None,
+        help='Specify a prompt template')
+    system_group = parser.add_mutually_exclusive_group()
+    system_group.add_argument(
+        '--system', default=None, help='Specify the system text')
+    system_group.add_argument(
+        '--system-template',
+        choices=SYSTEM_TEMPLATE.keys(),
+        default=None,
+        help='Specify a system template')
+    parser.add_argument(
+        '--bits',
+        type=int,
+        choices=[4, 8, None],
+        default=None,
+        help='LLM bits')
+    parser.add_argument(
+        '--bot-name', type=str, default='BOT', help='Name for Bot')
+    parser.add_argument(
+        '--with-plugins',
+        nargs='+',
+        choices=['calculate', 'solve', 'search'],
+        help='Specify plugins to use')
+    parser.add_argument(
+        '--no-streamer', action='store_true', help='Whether to with streamer')
+    parser.add_argument(
+        '--lagent', action='store_true', help='Whether to use lagent')
+    parser.add_argument(
+        '--stop-words', nargs='+', type=str, default=[], help='Stop words')
+    parser.add_argument(
+        '--offload-folder',
+        default=None,
+        help='The folder in which to offload the model weights (or where the '
+        'model weights are already offloaded).')
+    parser.add_argument(
+        '--max-new-tokens',
+        type=int,
+        default=2048,
+        help='Maximum number of new tokens allowed in generated text')
+    parser.add_argument(
+        '--temperature',
+        type=float,
+        default=0.1,
+        help='The value used to modulate the next token probabilities.')
+    parser.add_argument(
+        '--top-k',
+        type=int,
+        default=40,
+        help='The number of highest probability vocabulary tokens to '
+        'keep for top-k-filtering.')
+    parser.add_argument(
+        '--top-p',
+        type=float,
+        default=0.75,
+        help='If set to float < 1, only the smallest set of most probable '
+        'tokens with probabilities that add up to top_p or higher are '
+        'kept for generation.')
+    parser.add_argument(
+        '--repetition-penalty',
+        type=float,
+        default=1.0,
+        help='The parameter for repetition penalty. 1.0 means no penalty.')
+    parser.add_argument(
+        '--seed',
+        type=int,
+        default=0,
+        help='Random seed for reproducible text generation')
+    args = parser.parse_args()
+    return args
+def get_input():
+    """Helper function for getting input from users."""
+    sentinel = ''  # ends when this string is seen
+    result = None
+    while result is None:
+        print(('\ndouble enter to end input (EXIT: exit chat, '
+               'RESET: reset history) >>> '),
+              end='')
+        try:
+            result = '\n'.join(iter(input, sentinel))
+        except UnicodeDecodeError:
+            print('Invalid characters detected. Please enter again.')
+    return result
+def main():
+    args = parse_args()
+    torch.manual_seed(args.seed)
+    # build llm
+    quantization_config = None
+    load_in_8bit = False
+    if args.bits == 4:
+        quantization_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            load_in_8bit=False,
+            llm_int8_threshold=6.0,
+            llm_int8_has_fp16_weight=False,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4')
+    elif args.bits == 8:
+        load_in_8bit = True
+    model_kwargs = {
+        'quantization_config': quantization_config,
+        'load_in_8bit': load_in_8bit,
+        'device_map': 'auto',
+        'offload_folder': args.offload_folder,
+        'trust_remote_code': True,
+        'torch_dtype': TORCH_DTYPE_MAP[args.torch_dtype]
+    }
+    if args.lagent:
+        from lagent.actions import ActionExecutor, GoogleSearch
+        from lagent.agents import (CALL_PROTOCOL_CN, FORCE_STOP_PROMPT_CN,
+                                   ReAct, ReActProtocol)
+        from lagent.llms import HFTransformerCasualLM
+        try:
+            SERPER_API_KEY = os.environ['SERPER_API_KEY']
+        except Exception:
+            print('Please obtain the `SERPER_API_KEY` from https://serper.dev '
+                  'and set it using `export SERPER_API_KEY=xxx`.')
+            sys.exit(1)
+        model_kwargs.pop('trust_remote_code')
+        llm = HFTransformerCasualLM(
+            args.model_name_or_path, model_kwargs=model_kwargs)
+        if args.adapter is not None:
+            print(f'Loading adapter from {args.adapter}...')
+            llm.model = PeftModel.from_pretrained(
+                llm.model,
+                args.adapter,
+                offload_folder=args.offload_folder,
+                trust_remote_code=True)
+        search_tool = GoogleSearch(api_key=SERPER_API_KEY)
+        chatbot = ReAct(
+            llm=llm,
+            action_executor=ActionExecutor(actions=[search_tool]),
+            protocol=ReActProtocol(
+                call_protocol=CALL_PROTOCOL_CN,
+                force_stop=FORCE_STOP_PROMPT_CN))
+        while True:
+            text = get_input()
+            while text.strip() == 'RESET':
+                print('Log: History responses have been removed!')
+                chatbot._session_history = []
+                inputs = ''
+                text = get_input()
+            if text.strip() == 'EXIT':
+                print('Log: Exit!')
+                exit(0)
+            response = chatbot.chat(text)
+            print(response.response)
+    else:
+        if args.with_plugins is None:
+            inner_thoughts_open = False
+            calculate_open = False
+            solve_open = False
+            search_open = False
+        else:
+            assert args.prompt_template == args.system_template == 'moss_sft'
+            from plugins import plugins_api
+            inner_thoughts_open = True
+            calculate_open = 'calculate' in args.with_plugins
+            solve_open = 'solve' in args.with_plugins
+            search_open = 'search' in args.with_plugins
+            # pre-import for api and model preparation
+            if calculate_open:
+                from plugins import calculate  # noqa: F401
+            if solve_open:
+                from plugins import solve  # noqa: F401
+            if search_open:
+                from plugins import search  # noqa: F401
+        # build llm
+        llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
+                                                   **model_kwargs)
+        tokenizer = AutoTokenizer.from_pretrained(
+            args.model_name_or_path,
+            trust_remote_code=True,
+            encode_special_tokens=True)
+        print(f'Load LLM from {args.model_name_or_path}')
+        if args.adapter is not None:
+            llm = PeftModel.from_pretrained(
+                llm,
+                args.adapter,
+                offload_folder=args.offload_folder,
+                trust_remote_code=True)
+            print(f'Load adapter from {args.adapter}')
+        if args.llava is not None:
+            llava_path = snapshot_download(
+                repo_id=args.llava) if not osp.isdir(
+                    args.llava) else args.llava
+            # build visual_encoder
+            if 'visual_encoder' in os.listdir(llava_path):
+                assert args.visual_encoder is None, (
+                    "Please don't specify the `--visual-encoder` since passed "
+                    '`--llava` contains a visual encoder!')
+                visual_encoder_path = osp.join(llava_path, 'visual_encoder')
+            else:
+                assert args.visual_encoder is not None, (
+                    'Please specify the `--visual-encoder`!')
+                visual_encoder_path = args.visual_encoder
+            visual_encoder = SiglipVisionModel.from_pretrained(
+                visual_encoder_path,
+                torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
+            image_processor = SiglipImageProcessor.from_pretrained(
+                visual_encoder_path)
+            print(f'Load visual_encoder from {visual_encoder_path}')
+            # load adapter
+            if 'llm_adapter' in os.listdir(llava_path):
+                adapter_path = osp.join(llava_path, 'llm_adapter')
+                llm = PeftModel.from_pretrained(
+                    llm,
+                    adapter_path,
+                    offload_folder=args.offload_folder,
+                    trust_remote_code=True)
+                print(f'Load LLM adapter from {args.llava}')
+            if 'visual_encoder_adapter' in os.listdir(llava_path):
+                adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
+                visual_encoder = PeftModel.from_pretrained(
+                    visual_encoder,
+                    adapter_path,
+                    offload_folder=args.offload_folder)
+                print(f'Load visual_encoder adapter from {args.llava}')
+            # build projector
+            projector_path = osp.join(llava_path, 'projector')
+            projector = AutoModel.from_pretrained(
+                projector_path,
+                torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype],
+                trust_remote_code=True)
+            print(f'Load projector from {args.llava}')
+            projector.cuda()
+            projector.eval()
+            visual_encoder.cuda()
+            visual_encoder.eval()
+        llm.eval()
+        if args.image is not None:
+            image = load_image(args.image)
+            image = expand2square(
+                image, tuple(int(x * 255) for x in image_processor.image_mean))
+            image = image_processor.preprocess(
+                image, return_tensors='pt')['pixel_values'][0]
+            image = image.cuda().unsqueeze(0)
+            visual_outputs = visual_encoder(image, output_hidden_states=True)
+            pixel_values = projector(
+                visual_outputs.hidden_states[args.visual_select_layer][:, 1:])
+        stop_words = args.stop_words
+        sep = ''
+        if args.prompt_template:
+            template = PROMPT_TEMPLATE[args.prompt_template]
+            stop_words += template.get('STOP_WORDS', [])
+            sep = template.get('SEP', '')
+        stop_criteria = get_stop_criteria(
+            tokenizer=tokenizer, stop_words=stop_words)
+        if args.no_streamer:
+            streamer = None
+        else:
+            streamer = TextStreamer(tokenizer, skip_prompt=True)
+        gen_config = GenerationConfig(
+            max_new_tokens=args.max_new_tokens,
+            do_sample=args.temperature > 0,
+            temperature=args.temperature,
+            top_p=args.top_p,
+            top_k=args.top_k,
+            repetition_penalty=args.repetition_penalty,
+            eos_token_id=tokenizer.eos_token_id,
+            pad_token_id=tokenizer.pad_token_id
+            if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
+        )
+        n_turn = 0
+        inputs = ''
+        while True:
+            text = get_input()
+            while text.strip() == 'RESET':
+                print('Log: History responses have been removed!')
+                n_turn = 0
+                inputs = ''
+                text = get_input()
+            if text.strip() == 'EXIT':
+                print('Log: Exit!')
+                exit(0)
+            if args.image is not None and n_turn == 0:
+                text = DEFAULT_IMAGE_TOKEN + '\n' + text
+            if args.prompt_template:
+                prompt_text = ''
+                template = PROMPT_TEMPLATE[args.prompt_template]
+                if 'SYSTEM' in template and n_turn == 0:
+                    system_text = None
+                    if args.system_template is not None:
+                        system_text = SYSTEM_TEMPLATE[
+                            args.system_template].format(
+                                round=n_turn + 1, bot_name=args.bot_name)
+                    elif args.system is not None:
+                        system_text = args.system
+                    if system_text is not None:
+                        prompt_text += template['SYSTEM'].format(
+                            system=system_text,
+                            round=n_turn + 1,
+                            bot_name=args.bot_name)
+                prompt_text += template['INSTRUCTION'].format(
+                    input=text, round=n_turn + 1, bot_name=args.bot_name)
+                if args.prompt_template == args.system_template == 'moss_sft':
+                    if not inner_thoughts_open:
+                        prompt_text.replace('- Inner thoughts: enabled.',
+                                            '- Inner thoughts: disabled.')
+                    if not calculate_open:
+                        prompt_text.replace(('- Calculator: enabled. API: '
+                                             'Calculate(expression)'),
+                                            '- Calculator: disabled.')
+                    if not solve_open:
+                        prompt_text.replace(
+                            '- Equation solver: enabled. API: Solve(equation)',
+                            '- Equation solver: disabled.')
+                    if not search_open:
+                        prompt_text.replace(
+                            '- Web search: enabled. API: Search(query)',
+                            '- Web search: disabled.')
+            else:
+                prompt_text = text
+            inputs += prompt_text
+            if args.image is None:
+                if n_turn == 0:
+                    ids = tokenizer.encode(inputs, return_tensors='pt')
+                else:
+                    ids = tokenizer.encode(
+                        inputs, return_tensors='pt', add_special_tokens=False)
+                if args.with_plugins is not None:
+                    generate_output = llm.generate(
+                        inputs=ids.cuda(),
+                        generation_config=gen_config,
+                        streamer=streamer,
+                        stopping_criteria=stop_criteria).cpu()
+                    generate_output_text = tokenizer.decode(
+                        generate_output[0][len(ids[0]):])
+                    if streamer is None:
+                        end = '' if generate_output_text[-1] == '\n' else '\n'
+                        print(generate_output_text, end=end)
+                    pattern = r'<\|Commands\|>:(.*?)<eoc>'
+                    command_text = ', '.join(
+                        re.findall(pattern, generate_output_text))
+                    extent_text = plugins_api(
+                        command_text,
+                        calculate_open=calculate_open,
+                        solve_open=solve_open,
+                        search_open=search_open)
+                    end = '' if extent_text[-1] == '\n' else '\n'
+                    print(extent_text, end=end)
+                    extent_text_ids = tokenizer.encode(
+                        extent_text,
+                        return_tensors='pt',
+                        add_special_tokens=False)
+                    new_ids = torch.cat((generate_output, extent_text_ids),
+                                        dim=1)
+                    generate_output = llm.generate(
+                        inputs=new_ids.cuda(),
+                        generation_config=gen_config,
+                        streamer=streamer,
+                        stopping_criteria=stop_criteria)
+                    if streamer is None:
+                        output_text = tokenizer.decode(
+                            generate_output[0][len(new_ids[0]):])
+                        end = '' if output_text[-1] == '\n' else '\n'
+                        print(output_text, end=end)
+                else:
+                    generate_output = llm.generate(
+                        inputs=ids.cuda(),
+                        generation_config=gen_config,
+                        streamer=streamer,
+                        stopping_criteria=stop_criteria)
+                    if streamer is None:
+                        output_text = tokenizer.decode(
+                            generate_output[0][len(ids[0]):])
+                        end = '' if output_text[-1] == '\n' else '\n'
+                        print(output_text, end=end)
+                inputs = tokenizer.decode(generate_output[0])
+            else:
+                chunk_encode = []
+                for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
+                    if idx == 0 and n_turn == 0:
+                        cur_encode = tokenizer.encode(chunk)
+                    else:
+                        cur_encode = tokenizer.encode(
+                            chunk, add_special_tokens=False)
+                    chunk_encode.append(cur_encode)
+                assert len(chunk_encode) == 2
+                ids = []
+                for idx, cur_chunk_encode in enumerate(chunk_encode):
+                    ids.extend(cur_chunk_encode)
+                    if idx != len(chunk_encode) - 1:
+                        ids.append(IMAGE_TOKEN_INDEX)
+                ids = torch.tensor(ids).cuda().unsqueeze(0)
+                mm_inputs = prepare_inputs_labels_for_multimodal(
+                    llm=llm, input_ids=ids, pixel_values=pixel_values)
+                generate_output = llm.generate(
+                    **mm_inputs,
+                    generation_config=gen_config,
+                    streamer=streamer,
+                    bos_token_id=tokenizer.bos_token_id,
+                    stopping_criteria=stop_criteria)
+                if streamer is None:
+                    output_text = tokenizer.decode(generate_output[0])
+                    end = '' if output_text[-1] == '\n' else '\n'
+                    print(output_text, end=end)
+                inputs += tokenizer.decode(generate_output[0])
+            n_turn += 1
+            inputs += sep
+            if len(generate_output[0]) >= args.max_new_tokens:
+                print(
+                    'Remove the memory of history responses, since '
+                    f'it exceeds the length limitation {args.max_new_tokens}.')
+                n_turn = 0
+                inputs = ''
+if __name__ == '__main__':
+    main()

modified_xtuner/xtuner/tools/mmbench.py ADDED Viewed

	@@ -0,0 +1,510 @@

+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import json
+import math
+import os
+import os.path as osp
+import re
+import string
+import time
+import numpy as np
+import pandas as pd
+import torch
+import tqdm
+from huggingface_hub import snapshot_download
+from mmengine import mkdir_or_exist
+from mmengine.dist import (collect_results, get_dist_info, get_rank, init_dist,
+                           master_only)
+from mmengine.utils.dl_utils import set_multi_processing
+from peft import PeftModel
+from rich.console import Console
+from rich.table import Table
+from torch.utils.data import Dataset
+from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
+                          BitsAndBytesConfig, SiglipImageProcessor,
+                          SiglipVisionModel, GenerationConfig)
+from xtuner.dataset.utils import decode_base64_to_image, expand2square
+from xtuner.model.utils import LoadWoInit, prepare_inputs_labels_for_multimodal
+from xtuner.tools.utils import get_stop_criteria, is_cn_string
+from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
+                          PROMPT_TEMPLATE)
+TORCH_DTYPE_MAP = dict(
+    fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32, auto='auto')
+def parse_args():
+    parser = argparse.ArgumentParser(description='MMBench')
+    parser.add_argument(
+        'model_name_or_path', help='Hugging Face model name or path')
+    parser.add_argument('--data-path', default=None, help='data path')
+    parser.add_argument('--work-dir', help='the dir to save results')
+    parser.add_argument('--llava', default=None, help='llava name or path')
+    parser.add_argument(
+        '--visual-encoder', default=None, help='visual encoder name or path')
+    parser.add_argument(
+        '--visual-select-layer', default=-2, help='visual select layer')
+    parser.add_argument(
+        '--prompt-template',
+        choices=PROMPT_TEMPLATE.keys(),
+        default=None,
+        help='Specify a prompt template')
+    parser.add_argument(
+        '--stop-words', nargs='+', type=str, default=[], help='Stop words')
+    parser.add_argument(
+        '--torch-dtype',
+        default='fp16',
+        choices=TORCH_DTYPE_MAP.keys(),
+        help='Override the default `torch.dtype` and load the model under '
+        'a specific `dtype`.')
+    parser.add_argument(
+        '--bits',
+        type=int,
+        choices=[4, 8, None],
+        default=None,
+        help='LLM bits')
+    parser.add_argument(
+        '--bot-name', type=str, default='BOT', help='Name for Bot')
+    parser.add_argument(
+        '--offload-folder',
+        default=None,
+        help='The folder in which to offload the model weights (or where the '
+        'model weights are already offloaded).')
+    parser.add_argument(
+        '--max-new-tokens',
+        type=int,
+        default=100,
+        help='Maximum number of new tokens allowed in generated text')
+    parser.add_argument(
+        '--seed',
+        type=int,
+        default=0,
+        help='Random seed for reproducible text generation')
+    parser.add_argument(
+        '--launcher',
+        choices=['none', 'pytorch', 'slurm', 'mpi'],
+        default='none',
+        help='job launcher')
+    args = parser.parse_args()
+    return args
+@master_only
+def master_print(msg):
+    print(msg)
+class MMBenchDataset(Dataset):
+    ABBRS = {
+        'coarse_perception': 'CP',
+        'finegrained_perception (instance-level)': 'FP-S',
+        'finegrained_perception (cross-instance)': 'FP-C',
+        'logic_reasoning': 'LR',
+        'relation_reasoning': 'RR',
+        'attribute_reasoning': 'AR',
+        'sketch_reasoning': 'Sketch Reasoning',
+        'scenery_building': 'Scenery & Building',
+        'food_clothes': 'Food & Clothes',
+        'historical_figure': 'Historical Figure',
+        'traditional_show': 'Traditional Show',
+        'calligraphy_painting': 'Calligraphy Painting',
+        'cultural_relic': 'Cultural Relic'
+    }
+    def __init__(self, data_file):
+        self.data_file = data_file
+        self.df = pd.read_csv(data_file, sep='\t')
+        self.split = 'dev' if 'answer' in self.df.iloc[0].keys() else 'test'
+        self.has_l2_category = 'l2-category' in self.df.columns.to_list()
+    def get_image(self, image):
+        while len(image) < 16:
+            image = self.df[self.df['index'] == int(image)]['image'].values
+            assert len(image) == 1
+            image = image[0]
+        image = decode_base64_to_image(image)
+        return image
+    def __len__(self):
+        return len(self.df)
+    def __getitem__(self, idx):
+        index = self.df.iloc[idx]['index']
+        image = self.df.iloc[idx]['image']
+        image = self.get_image(image)
+        question = self.df.iloc[idx]['question']
+        answer = self.df.iloc[idx]['answer'] if 'answer' in self.df.iloc[
+            0].keys() else None
+        category = self.df.iloc[idx]['category']
+        options = {
+            cand: self.load_from_df(idx, cand)
+            for cand in string.ascii_uppercase
+            if self.load_from_df(idx, cand) is not None
+        }
+        options_prompt = ''
+        for key, item in options.items():
+            options_prompt += f'{key}. {item}\n'
+        hint = self.load_from_df(idx, 'hint')
+        data = {
+            'img': image,
+            'question': question,
+            'answer': answer,
+            'options': options_prompt,
+            'category': category,
+            'options_dict': options,
+            'index': index,
+            'context': hint,
+        }
+        if self.has_l2_category:
+            data.update({'l2-category': self.df.iloc[idx]['l2-category']})
+        return data
+    def load_from_df(self, idx, key):
+        if key in self.df.iloc[idx] and not pd.isna(self.df.iloc[idx][key]):
+            return self.df.iloc[idx][key]
+        else:
+            return None
+    @master_only
+    def eval_result(self, result_df, show=True):
+        def calc_acc(df, group='category'):
+            assert group in ['overall', 'category', 'l2-category']
+            if group == 'overall':
+                res = {'Average': np.mean(df['hit'])}
+            else:
+                res = {}
+                abilities = list(set(df[group]))
+                abilities.sort()
+                for ab in abilities:
+                    sub_df = df[df[group] == ab]
+                    ab = self.ABBRS[ab] if ab in self.ABBRS else ab
+                    res[ab] = np.mean(sub_df['hit'])
+            return res
+        def eval_sub_data(sub_data, answer_map):
+            lt = len(sub_data)
+            for i in range(lt):
+                item = sub_data.iloc[i]
+                match = re.search(r'([A-D]+)', item['prediction'])
+                pred = match.group(1) if match else ''
+                gt = answer_map[item['index']]
+                if gt != pred:
+                    return 0
+            return 1
+        def show_result(ret_json):
+            show_dict = ret_json.copy()
+            table = Table(title=f' MMBench ({self.data_file}) ')
+            console = Console()
+            table.add_column('Category', justify='left')
+            table.add_column('Accuracy (%)', justify='right')
+            average = show_dict.pop('Average') * 100
+            table.add_row('Average', f'{average:.1f}')
+            table.add_section()
+            for cat_name, cat_acc in show_dict.items():
+                table.add_row(cat_name, f'{cat_acc * 100:.1f}')
+            with console.capture() as capture:
+                console.print(table, end='')
+            print('\n' + capture.get())
+            print('Note: Please be cautious if you use the results in papers, '
+                  "since we don't use ChatGPT as a helper for choice "
+                  'extraction')
+        data = result_df.sort_values(by='index')
+        data['prediction'] = [str(x) for x in data['prediction']]
+        for k in data.keys():
+            data[k.lower() if k not in 'ABCD' else k] = data.pop(k)
+        data_main = data[data['index'] < int(1e6)]
+        cate_map = {
+            i: c
+            for i, c in zip(self.df['index'], self.df['category'])
+        }
+        if self.has_l2_category:
+            l2_cate_map = {
+                i: c
+                for i, c in zip(self.df['index'], self.df['l2-category'])
+            }
+        answer_map = {
+            i: c
+            for i, c in zip(self.df['index'], self.df['answer'])
+        }
+        lt = len(data_main)
+        hit, tot = 0, 0
+        result = {}
+        for i in range(lt):
+            item_main = data_main.iloc[i]
+            idx = item_main['index']
+            assert idx not in result
+            sub_data = data[data['index'] % int(1e6) == idx]
+            ret = eval_sub_data(sub_data, answer_map)
+            result[idx] = ret
+            hit += ret
+            tot += 1
+        indices = data_main['index']
+        data_main = data_main.copy()
+        data_main['hit'] = [result[i] for i in indices]
+        main_idx = data_main['index']
+        data_main['category'] = [cate_map[i] for i in main_idx]
+        ret_json = calc_acc(data_main, 'overall')
+        if self.has_l2_category:
+            data_main['l2-category'] = [l2_cate_map[i] for i in main_idx]
+            l2 = calc_acc(data_main, 'l2-category')
+            ret_json.update(l2)
+        else:
+            leaf = calc_acc(data_main, 'category')
+            ret_json.update(leaf)
+        if show:
+            show_result(ret_json)
+        return ret_json
+def main():
+    args = parse_args()
+    torch.manual_seed(args.seed)
+    if args.launcher != 'none':
+        set_multi_processing(distributed=True)
+        init_dist(args.launcher)
+        rank, world_size = get_dist_info()
+        torch.cuda.set_device(rank)
+    else:
+        rank = 0
+        world_size = 1
+    # build llm
+    quantization_config = None
+    load_in_8bit = False
+    if args.bits == 4:
+        quantization_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            load_in_8bit=False,
+            llm_int8_threshold=6.0,
+            llm_int8_has_fp16_weight=False,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4')
+    elif args.bits == 8:
+        load_in_8bit = True
+    model_kwargs = {
+        'quantization_config': quantization_config,
+        'load_in_8bit': load_in_8bit,
+        'device_map': rank if world_size > 1 else 'auto',
+        'offload_folder': args.offload_folder,
+        'trust_remote_code': True,
+        'torch_dtype': TORCH_DTYPE_MAP[args.torch_dtype]
+    }
+    # build llm
+    with LoadWoInit():
+        llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
+                                                   **model_kwargs)
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.model_name_or_path,
+        trust_remote_code=True,
+        encode_special_tokens=True)
+    master_print(f'Load LLM from {args.model_name_or_path}')
+    llava_path = snapshot_download(
+        repo_id=args.llava) if not osp.isdir(args.llava) else args.llava
+    # build visual_encoder
+    if 'visual_encoder' in os.listdir(llava_path):
+        assert args.visual_encoder is None, (
+            "Please don't specify the `--visual-encoder` since passed "
+            '`--llava` contains a visual encoder!')
+        visual_encoder_path = osp.join(llava_path, 'visual_encoder')
+    else:
+        assert args.visual_encoder is not None, (
+            'Please specify the `--visual-encoder`!')
+        visual_encoder_path = args.visual_encoder
+    with LoadWoInit():
+        visual_encoder = SiglipVisionModel.from_pretrained(
+            visual_encoder_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
+        image_processor = SiglipImageProcessor.from_pretrained(
+            visual_encoder_path)
+    master_print(f'Load visual_encoder from {visual_encoder_path}')
+    # load adapter
+    if 'llm_adapter' in os.listdir(llava_path):
+        adapter_path = osp.join(llava_path, 'llm_adapter')
+        with LoadWoInit():
+            llm = PeftModel.from_pretrained(
+                llm, adapter_path, offload_folder=args.offload_folder)
+        master_print(f'Load LLM adapter from {args.llava}')
+    if 'visual_encoder_adapter' in os.listdir(llava_path):
+        adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
+        visual_encoder = PeftModel.from_pretrained(
+            visual_encoder, adapter_path, offload_folder=args.offload_folder)
+        master_print(f'Load visual_encoder adapter from {args.llava}')
+    # build projector
+    projector_path = osp.join(llava_path, 'projector')
+    with LoadWoInit():
+        projector = AutoModel.from_pretrained(
+            projector_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
+    master_print(f'Load projector from {args.llava}')
+    projector.cuda()
+    projector.eval()
+    visual_encoder.cuda()
+    visual_encoder.eval()
+    llm.eval()
+    stop_words = args.stop_words
+    if args.prompt_template:
+        template = PROMPT_TEMPLATE[args.prompt_template]
+        stop_words += template.get('STOP_WORDS', [])
+    stop_criteria = get_stop_criteria(
+        tokenizer=tokenizer, stop_words=stop_words)
+    gen_config = GenerationConfig(
+        max_new_tokens=args.max_new_tokens,
+        do_sample=False,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.pad_token_id
+        if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
+    )
+    # work_dir
+    if args.work_dir is not None:
+        # update configs according to CLI args if args.work_dir is not None
+        save_dir = args.work_dir
+    else:
+        # use config filename as default work_dir
+        save_dir = osp.join('./work_dirs',
+                            osp.splitext(osp.basename(args.data_path))[0])
+    timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))
+    save_dir = osp.join(save_dir, timestamp)
+    if rank == 0:
+        mkdir_or_exist(osp.abspath(save_dir))
+        print('=======================================================')
+        print(f'Dataset path: {osp.abspath(args.data_path)}\n'
+              f'Results will be saved to {osp.abspath(save_dir)}')
+        print('=======================================================')
+        args_path = osp.join(save_dir, 'args.json')
+        with open(args_path, 'w') as f:
+            json.dump(args.__dict__, f, indent=2)
+    results_xlsx_path = osp.join(save_dir, 'mmbench_result.xlsx')
+    results_json_path = osp.join(save_dir, 'mmbench_result.json')
+    dataset = MMBenchDataset(args.data_path)
+    results = []
+    n_samples = len(dataset)
+    per_rank_samples = math.ceil(n_samples / world_size)
+    per_rank_ids = range(per_rank_samples * rank,
+                         min(n_samples, per_rank_samples * (rank + 1)))
+    for i in tqdm.tqdm(per_rank_ids, desc=f'Rank {rank}'):
+        data_sample = dataset[i]
+        if data_sample['context'] is not None:
+            text = data_sample['context'] + '\n' + data_sample[
+                'question'] + '\n' + data_sample['options']
+        else:
+            text = data_sample['question'] + '\n' + data_sample['options']
+        text = DEFAULT_IMAGE_TOKEN + '\n' + text
+        if is_cn_string(text):
+            text = text + '请直接回答选项字母。'
+        else:
+            text = text + ("Answer with the option's letter from the "
+                           'given choices directly.')
+        if args.prompt_template:
+            prompt_text = ''
+            template = PROMPT_TEMPLATE[args.prompt_template]
+            prompt_text += template['INSTRUCTION'].format(
+                input=text, round=1, bot_name=args.bot_name)
+        else:
+            prompt_text = text
+        inputs = prompt_text
+        image = data_sample['img'].convert('RGB')
+        image = expand2square(
+            image, tuple(int(x * 255) for x in image_processor.image_mean))
+        image = image_processor.preprocess(
+            image, return_tensors='pt')['pixel_values'][0]
+        image = image.cuda().unsqueeze(0)
+        visual_outputs = visual_encoder(image, output_hidden_states=True)
+        pixel_values = projector(
+            visual_outputs.hidden_states[args.visual_select_layer][:, 1:])
+        chunk_encode = []
+        for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
+            if idx == 0:
+                cur_encode = tokenizer.encode(chunk)
+            else:
+                cur_encode = tokenizer.encode(chunk, add_special_tokens=False)
+            chunk_encode.append(cur_encode)
+        assert len(chunk_encode) == 2
+        ids = []
+        for idx, cur_chunk_encode in enumerate(chunk_encode):
+            ids.extend(cur_chunk_encode)
+            if idx != len(chunk_encode) - 1:
+                ids.append(IMAGE_TOKEN_INDEX)
+        ids = torch.tensor(ids).cuda().unsqueeze(0)
+        mm_inputs = prepare_inputs_labels_for_multimodal(
+            llm=llm, input_ids=ids, pixel_values=pixel_values)
+        generate_output = llm.generate(
+            **mm_inputs,
+            generation_config=gen_config,
+            streamer=None,
+            bos_token_id=tokenizer.bos_token_id,
+            stopping_criteria=stop_criteria)
+        predict = tokenizer.decode(
+            generate_output[0], skip_special_tokens=True).strip()
+        cur_result = {}
+        cur_result['question'] = data_sample.get('question')
+        cur_result.update(data_sample.get('options_dict'))
+        cur_result['prediction'] = predict
+        if data_sample.get('category') is not None:
+            cur_result['category'] = data_sample.get('category')
+        if data_sample.get('l2-category') is not None:
+            cur_result['l2-category'] = data_sample.get('l2-category')
+        cur_result['index'] = data_sample.get('index')
+        cur_result['split'] = data_sample.get('split')
+        cur_result['answer'] = data_sample.get('answer')
+        results.append(cur_result)
+    results = collect_results(results, n_samples)
+    if get_rank() == 0:
+        results_df = pd.DataFrame(results)
+        with pd.ExcelWriter(results_xlsx_path, engine='openpyxl') as writer:
+            results_df.to_excel(writer, index=False)
+        if dataset.split == 'dev':
+            results_dict = dataset.eval_result(results_df, show=True)
+            with open(results_json_path, 'w') as f:
+                json.dump(results_dict, f, indent=2)
+        else:
+            print('All done!')
+if __name__ == '__main__':
+    main()