czczup commited on
Commit
62d21d2
1 Parent(s): 4635e95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -26
README.md CHANGED
@@ -10,13 +10,19 @@ datasets:
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
- # Model Card for InternVL-Chat-Chinese-V1.2
14
 
15
- ## What is InternVL?
16
 
17
- \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
 
 
 
 
 
 
 
18
 
19
- InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
20
 
21
  ## InternVL-Chat-V1.2 Blog
22
 
@@ -42,20 +48,20 @@ For more details about data preparation, please see [here](https://github.com/Op
42
  \* Proprietary Model
43
 
44
  | name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA<br>(val) | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
45
- | ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ------- | ----------------- | ---------------- | ------------- |
46
- | GPT−4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - |
47
- | Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - |
48
- | Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - |
49
- | Qwen−VL−Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - |
50
- | Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
51
- | | | | | | | | | | | | | | | |
52
- | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
53
- | InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1672/509 | 83.3 | 88.0 | 69.7 | 75.6 | 60.0 | 64.0 |
54
-
55
- - MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
56
  - In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
 
57
 
58
- ### Training (SFT)
59
 
60
  We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
61
 
@@ -69,32 +75,28 @@ The hyperparameters used for finetuning are listed in the following table.
69
 
70
 
71
  ## Model Details
72
- - **Model Type:** vision large language model, multimodal chatbot
73
  - **Model Stats:**
74
  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
 
75
  - Params: 40B
76
- - Image size: 448 x 448
77
- - Number of visual tokens: 256
78
 
79
  - **Training Strategy:**
80
  - Pretraining Stage
81
  - Learnable Component: MLP
82
  - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
83
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
84
- - SFT Stage
85
  - Learnable Component: ViT + MLP + LLM
86
- - Data: A simplified, fully open-source dataset, containing approximately 1 million samples.
87
 
88
 
89
  ## Model Usage
90
 
91
- We provide a minimum code example to run InternVL-Chat using only the `transformers` library.
92
 
93
  You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
94
 
95
- Note: If you meet this error `ImportError: This modeling file requires the following packages that were not found in your environment: fastchat`, please run `pip install fschat`.
96
-
97
-
98
  ```python
99
  import torch
100
  from PIL import Image
@@ -145,7 +147,6 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
145
  print(question, response)
146
  ```
147
 
148
-
149
  ## Citation
150
 
151
  If you find this project useful in your research, please consider citing:
 
10
  pipeline_tag: visual-question-answering
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1.2
14
 
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/Sx8dq7ReqSLOgvA_oTmXL.webp" alt="Image Description" width="300" height="300">
16
 
17
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
18
+
19
+ | Model | Date | Download | Note |
20
+ | ----------------------- | ---------- | ------------------------------------------------------------------------------------ | ---------------------------------- |
21
+ | InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
22
+ | InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | more SFT data and stronger |
23
+ | InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B |
24
+ | InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
25
 
 
26
 
27
  ## InternVL-Chat-V1.2 Blog
28
 
 
48
  \* Proprietary Model
49
 
50
  | name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA<br>(val) | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
51
+ | ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ---------------- | ----------------- | ---------------- | ------------- |
52
+ | GPT−4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - |
53
+ | Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - |
54
+ | Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - |
55
+ | Qwen−VL−Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - |
56
+ | Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
57
+ | | | | | | | | | | | | | | | |
58
+ | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
59
+ | InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
60
+
 
61
  - In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
62
+ - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
63
 
64
+ ### Training (Supervised Finetuning)
65
 
66
  We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
67
 
 
75
 
76
 
77
  ## Model Details
78
+ - **Model Type:** multimodal large language model (MLLM)
79
  - **Model Stats:**
80
  - Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
81
+ - Image size: 448 x 448 (256 tokens)
82
  - Params: 40B
 
 
83
 
84
  - **Training Strategy:**
85
  - Pretraining Stage
86
  - Learnable Component: MLP
87
  - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
88
  - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
89
+ - Supervised Finetuning Stage
90
  - Learnable Component: ViT + MLP + LLM
91
+ - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
92
 
93
 
94
  ## Model Usage
95
 
96
+ We provide an example code to run InternVL-Chat-V1.2 using `transformers`.
97
 
98
  You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
99
 
 
 
 
100
  ```python
101
  import torch
102
  from PIL import Image
 
147
  print(question, response)
148
  ```
149
 
 
150
  ## Citation
151
 
152
  If you find this project useful in your research, please consider citing: