Update README.md
Browse files
README.md
CHANGED
@@ -9,33 +9,12 @@ datasets:
|
|
9 |
- wanng/wukong100m
|
10 |
---
|
11 |
|
12 |
-
# Model Card for InternVL-Chat-Chinese-V1.2
|
13 |
-
|
14 |
-
## What is InternVL?
|
15 |
|
16 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
|
17 |
|
18 |
-
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
|
19 |
-
|
20 |
-
## InternVL-Chat-V1.2 Blog
|
21 |
-
|
22 |
-
> Date: 2024/02/12<br>
|
23 |
-
> Developed by: Zhe Chen, Weiyun Wang, Wenhai Wang, Erfei Cui, Zhangwei Gao, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai
|
24 |
-
|
25 |
-
We are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
|
26 |
-
|
27 |
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
28 |
|
29 |
-
From the experimental results, **we've observed that a stronger language model (34B) can better leverage the powerful capabilities of our vision foundation model ([InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)).**
|
30 |
-
|
31 |
-
For better training reproducibility, we follow the minimalist design and data efficiency similar to LLaVA-NeXT. To reduce training costs, we provide a pre-trained MLP projector and only employ around 1 million visual instruction tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32 A100 GPUs. The code, data, and model will be made publicly available.
|
32 |
-
|
33 |
-
### Data Preparation
|
34 |
-
|
35 |
-
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
36 |
-
|
37 |
-
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
38 |
-
|
39 |
### Performance
|
40 |
|
41 |
\* Proprietary Model
|
@@ -53,19 +32,6 @@ For more details about data preparation, please see [here](https://github.com/Op
|
|
53 |
| InternVL-Chat-V1.2-Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1624/551 | 98.1\* | 88.7 | 71.3\* | 76.4 | - | 66.9 |
|
54 |
|
55 |
- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
|
56 |
-
- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
|
57 |
-
|
58 |
-
### Training (SFT)
|
59 |
-
|
60 |
-
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
61 |
-
|
62 |
-
For more details about training, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#start-training).
|
63 |
-
|
64 |
-
The hyperparameters used for finetuning are listed in the following table.
|
65 |
-
|
66 |
-
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
67 |
-
| ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
68 |
-
| InternVL-Chat-V1.2 | 40B (full model) | 512 | 1e-5 | 1 | 2048 | 0.05 |
|
69 |
|
70 |
|
71 |
## Model Details
|
@@ -83,7 +49,7 @@ The hyperparameters used for finetuning are listed in the following table.
|
|
83 |
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
84 |
- SFT Stage
|
85 |
- Learnable Component: ViT + MLP + LLM
|
86 |
-
- Data:
|
87 |
|
88 |
|
89 |
## Model Usage
|
@@ -101,7 +67,7 @@ from PIL import Image
|
|
101 |
from transformers import AutoModel, CLIPImageProcessor
|
102 |
from transformers import AutoTokenizer
|
103 |
|
104 |
-
path = "OpenGVLab/InternVL-Chat-Chinese-V1-2"
|
105 |
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
|
106 |
model = AutoModel.from_pretrained(
|
107 |
path,
|
|
|
9 |
- wanng/wukong100m
|
10 |
---
|
11 |
|
12 |
+
# Model Card for InternVL-Chat-Chinese-V1.2-Plus
|
|
|
|
|
13 |
|
14 |
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
<img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
|
17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
### Performance
|
19 |
|
20 |
\* Proprietary Model
|
|
|
32 |
| InternVL-Chat-V1.2-Plus | 448x448 | 50.3 | 45.6 | 59.9 | 83.8 | 82.0 | 58.7 | 1624/551 | 98.1\* | 88.7 | 71.3\* | 76.4 | - | 66.9 |
|
33 |
|
34 |
- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
|
37 |
## Model Details
|
|
|
49 |
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
50 |
- SFT Stage
|
51 |
- Learnable Component: ViT + MLP + LLM
|
52 |
+
- Data: 12 million SFT samples.
|
53 |
|
54 |
|
55 |
## Model Usage
|
|
|
67 |
from transformers import AutoModel, CLIPImageProcessor
|
68 |
from transformers import AutoTokenizer
|
69 |
|
70 |
+
path = "OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus"
|
71 |
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
|
72 |
model = AutoModel.from_pretrained(
|
73 |
path,
|