BellaBei commited on
Commit
8a8473d
1 Parent(s): 24cefa3

Update Readme

Browse files
Files changed (1) hide show
  1. README.md +20 -15
README.md CHANGED
@@ -58,6 +58,7 @@ license_link: LICENSE
58
  - [Showcases](#showcases)
59
  - [How to use Yi-VL?](#how-to-use-yi-vl)
60
  - [Quick start](#quick-start)
 
61
  - [Misc.](#misc)
62
  - [Citation](#citation)
63
  - [Acknowledgements and attributions](#acknowledgements-and-attributions)
@@ -74,7 +75,7 @@ license_link: LICENSE
74
 
75
  - Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
76
 
77
- - Yi-34B-VL is the **first** open-source 34B vision language model worldwide.
78
 
79
  ## Models
80
 
@@ -95,7 +96,7 @@ Yi-VL offers the following features:
95
 
96
  - Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
97
 
98
- - Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448x448.
99
 
100
  ## Architecture
101
 
@@ -105,9 +106,9 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
105
 
106
  - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
107
 
108
- - Large Language Model (LLM): it's initialized with [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) or [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, a rich dataset of bilingual image-text pairs is leveraged.
109
 
110
- ![Yi-VL architecture]()
111
 
112
  ## Training
113
 
@@ -115,24 +116,22 @@ Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, whi
115
 
116
  Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
117
 
118
- - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs. The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
119
 
120
- - Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs.
121
 
122
- - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including the data of image caption, VQA, grounding and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
123
 
124
  Below are the parameters configured for each stage.
125
 
126
- Stage | Global batch size | Learning rate | Gradient clip | NO. of epochs
127
  |---|---|---|---|---
128
  Stage 1, 2 |4096|1e-4|0.5|1
129
  Stage 3|256|2e-5|1.0|2
130
 
131
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
132
-
133
  ### Training resource consumption
134
 
135
- - The training consumes 128 NVIDIA A100 GPUs.
136
 
137
  - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
138
 
@@ -166,16 +165,14 @@ Yi-VL outperforms all existing open-source models in [MMMU](https://mmmu-benchma
166
 
167
  - MMMU
168
 
169
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
170
 
171
  - CMMMU
172
 
173
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
174
 
175
  ## Showcases
176
 
177
- Yi-VL can describe images accurately and in detail with few hallucinations.
178
-
179
  Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
180
 
181
  - English
@@ -206,6 +203,14 @@ Notes:
206
 
207
  - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
208
 
 
 
 
 
 
 
 
 
209
  # Misc.
210
 
211
  ## Citation
 
58
  - [Showcases](#showcases)
59
  - [How to use Yi-VL?](#how-to-use-yi-vl)
60
  - [Quick start](#quick-start)
61
+ - [Hardware requirement](#hardware-requirement)
62
  - [Misc.](#misc)
63
  - [Citation](#citation)
64
  - [Acknowledgements and attributions](#acknowledgements-and-attributions)
 
75
 
76
  - Yi-VL demonstrates exceptional performance, **ranking first** among all existing open-source models in the latest benchmarks including [MMMU](https://mmmu-benchmark.github.io/#leaderboard) in English and [CMMMU](https://mmmu-benchmark.github.io/#leaderboard) in Chinese (based on data available up to January 2024).
77
 
78
+ - Yi-VL-34B is the **first** open-source 34B vision language model worldwide.
79
 
80
  ## Models
81
 
 
96
 
97
  - Strong image comprehension: Yi-VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
98
 
99
+ - Fine-grained image resolution: Yi-VL supports image understanding at a higher resolution of 448×448.
100
 
101
  ## Architecture
102
 
 
106
 
107
  - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations.
108
 
109
+ - Large Language Model (LLM): it's initialized with [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) or [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese.
110
 
111
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png)
112
 
113
  ## Training
114
 
 
116
 
117
  Yi-VL is trained to align visual information well to the semantic space of Yi LLM, which undergoes a comprehensive three-stage training process:
118
 
119
+ - Stage 1: The parameters of ViT and the projection module are trained using an image resolution of 224×224. The LLM weights are frozen. The training leverages an image caption dataset comprising 100 million image-text pairs from [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/). The primary objective is to enhance the ViT's knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.
120
 
121
+ - Stage 2: The image resolution of ViT is scaled up to 448×448, and the parameters of ViT and the projection module are trained. It aims to further boost the model's capability for discerning intricate visual details. The dataset used in this stage includes about 25 million image-text pairs, such as [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), [CLLaVA](https://huggingface.co/datasets/LinkSoul/Chinese-LLaVA-Vision-Instructions), [LLaVAR](https://llavar.github.io/), [Flickr](https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset), [VQAv2](https://paperswithcode.com/dataset/visual-question-answering-v2-0), [RefCOCO](https://github.com/lichengunc/refer/tree/master), [Visual7w](http://ai.stanford.edu/~yukez/visual7w/) and so on.
122
 
123
+ - Stage 3: The parameters of the entire model (that is, ViT, projection module, and LLM) are trained. The primary goal is to enhance the model's proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately 1 million image-text pairs, including [GQA](https://cs.stanford.edu/people/dorarad/gqa/download.html), [VizWiz VQA](https://vizwiz.org/tasks-and-datasets/vqa/), [TextCaps](https://opendatalab.com/OpenDataLab/TextCaps), [OCR-VQA](https://ocr-vqa.github.io/), [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/api.html), [LAION GPT4V](https://huggingface.co/datasets/laion/gpt4v-dataset) and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than 50,000 pairs.
124
 
125
  Below are the parameters configured for each stage.
126
 
127
+ Stage | Global batch size | Learning rate | Gradient clip | Epochs
128
  |---|---|---|---|---
129
  Stage 1, 2 |4096|1e-4|0.5|1
130
  Stage 3|256|2e-5|1.0|2
131
 
 
 
132
  ### Training resource consumption
133
 
134
+ - The training consumes 128 NVIDIA A800 (80G) GPUs.
135
 
136
  - The total training time amounted to approximately 10 days for Yi-VL-34B and 3 days for Yi-VL-6B.
137
 
 
165
 
166
  - MMMU
167
 
168
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/kCmXuwLbLvequ93kjh3mg.png)
169
 
170
  - CMMMU
171
 
172
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/6YuSakMCg3D2AozixdoZ0.png)
173
 
174
  ## Showcases
175
 
 
 
176
  Below are some representative examples of detailed description and visual question answering, showcasing the capabilities of Yi-VL.
177
 
178
  - English
 
203
 
204
  - You need to set the parameter `mm_vision_tower` in `config.json` to the local ViT path.
205
 
206
+ ## Hardware requirement
207
+
208
+ For model inference, the recommended GPU examples are:
209
+
210
+ - Yi-VL-6B: RTX 3090, RTX 4090, A10, A30
211
+
212
+ - Yi-VL-34B: 4 × RTX 4090, A800 (80 GB)
213
+
214
  # Misc.
215
 
216
  ## Citation