OpenGVLab
/

InternVL-Chat-V1-2-Plus

@@ -9,33 +9,12 @@ datasets:
 - wanng/wukong100m
 ---
-# Model Card for InternVL-Chat-Chinese-V1.2
-## What is InternVL?
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
-InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
-## InternVL-Chat-V1.2 Blog
-> Date: 2024/02/12<br>
-> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
-We are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
 <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
-From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
-For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
-### Data Preparation
-Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
-For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
 ### Performance
 \* Proprietary Model
@@ -53,19 +32,6 @@ For more details about data preparation, please see [here](https://github.com/Op
 | InternVL-Chat-V1.2-Plus | 448x448    | 50.3     | 45.6           | 59.9                    | 83.8          | 82.0             | 58.7 | 1624/551 | 98.1\*               | 88.7 | 71.3\*  | 76.4              | -                | 66.9          |
 - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
-- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
-### Training (SFT)
-We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
-For more details about training, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#start-training).
-The hyperparameters used for finetuning are listed in the following table.
-| Hyperparameter     | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
-| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
-| InternVL-Chat-V1.2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
 ## Model Details
@@ -83,7 +49,7 @@ The hyperparameters used for finetuning are listed in the following table.
     - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: ViT + MLP + LLM
-    - Data: A simplified, fully open-source dataset, containing approximately 1 million samples.
 ## Model Usage
@@ -101,7 +67,7 @@ from PIL import Image
 from transformers import AutoModel, CLIPImageProcessor
 from transformers import AutoTokenizer
-path = "OpenGVLab/InternVL-Chat-Chinese-V1-2"
 # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
 model = AutoModel.from_pretrained(
     path,

 - wanng/wukong100m
 ---
+# Model Card for InternVL-Chat-Chinese-V1.2-Plus
 \[[Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
 <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
 ### Performance
 \* Proprietary Model
 | InternVL-Chat-V1.2-Plus | 448x448    | 50.3     | 45.6           | 59.9                    | 83.8          | 82.0             | 58.7 | 1624/551 | 98.1\*               | 88.7 | 71.3\*  | 76.4              | -                | 66.9          |
 - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
 ## Model Details
     - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: ViT + MLP + LLM
+    - Data: 12 million SFT samples.
 ## Model Usage
 from transformers import AutoModel, CLIPImageProcessor
 from transformers import AutoTokenizer
+path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
 # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
 model = AutoModel.from_pretrained(
     path,