Zhang Hui
commited on
Commit
•
83bb89e
1
Parent(s):
3fd04ab
update readme
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ license: apache-2.0
|
|
9 |
|
10 |
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
11 |
|
12 |
-
|
13 |
|
14 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
15 |
|
@@ -68,7 +68,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
68 |
| Gemini Pro | 47.9 | ---- |
|
69 |
| Yi-VL-34B | 45.9 | 41.6 |
|
70 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
71 |
-
|
|
72 |
| Macro-VL | 41.2 | 40.4 |
|
73 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
74 |
| Yi-VL-6B | 39.1 | 37.8 |
|
@@ -83,7 +83,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
83 |
|--------------------------------|:---------:|:------------:|
|
84 |
| GPT-4V(ision) (Playground) | 42.5 | 43.7 |
|
85 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
86 |
-
|
|
87 |
| Yi-VL-34B | 36.2 | 36.5 |
|
88 |
| Yi-VL-6B | 35.8 | 35.0 |
|
89 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
@@ -104,7 +104,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
104 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
105 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
106 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
107 |
-
|
|
108 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
109 |
|
110 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
@@ -113,7 +113,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
113 |
|---------------|:----------:|:---------:|
|
114 |
| GPT4v | 1409.43 | 517.14 |
|
115 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
116 |
-
|
|
117 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
118 |
|
119 |
- **Open Compress**
|
|
|
9 |
|
10 |
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
11 |
|
12 |
+
一个开源多模态大模型,紧密模拟了GPT4V/Qwen-VL系列模型的功能。该模型建立在Qwen-72b-Chat的基础上,CatVision可以处理包含交错的图像/文本输入。该模型旨在有效地遵循输出格式的指令,从Qwen72b的优势中受益。
|
13 |
|
14 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
15 |
|
|
|
68 |
| Gemini Pro | 47.9 | ---- |
|
69 |
| Yi-VL-34B | 45.9 | 41.6 |
|
70 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
71 |
+
| **CatVision** | 45.9 | 40.1 |
|
72 |
| Macro-VL | 41.2 | 40.4 |
|
73 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
74 |
| Yi-VL-6B | 39.1 | 37.8 |
|
|
|
83 |
|--------------------------------|:---------:|:------------:|
|
84 |
| GPT-4V(ision) (Playground) | 42.5 | 43.7 |
|
85 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
86 |
+
| **CatVision** | 39.6 | ---- |
|
87 |
| Yi-VL-34B | 36.2 | 36.5 |
|
88 |
| Yi-VL-6B | 35.8 | 35.0 |
|
89 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
|
|
104 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
105 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
106 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
107 |
+
| **CatVision** | 70.9 | 71.8 | 70.2 | 71.6 | 49.8 |
|
108 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
109 |
|
110 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
|
|
113 |
|---------------|:----------:|:---------:|
|
114 |
| GPT4v | 1409.43 | 517.14 |
|
115 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
116 |
+
| **CatVision** | 1560.90 | 366.43 |
|
117 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
118 |
|
119 |
- **Open Compress**
|