Zhang Hui
commited on
Commit
•
ff8d687
1
Parent(s):
83bb89e
update readme
Browse files
README.md
CHANGED
@@ -4,12 +4,15 @@ license: apache-2.0
|
|
4 |
|
5 |
# CatVision
|
6 |
|
7 |
-
|
8 |
## Introduction
|
9 |
|
10 |
-
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
|
|
|
|
11 |
|
12 |
-
|
|
|
|
|
13 |
|
14 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
15 |
|
@@ -68,7 +71,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
68 |
| Gemini Pro | 47.9 | ---- |
|
69 |
| Yi-VL-34B | 45.9 | 41.6 |
|
70 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
71 |
-
| **CatVision**
|
72 |
| Macro-VL | 41.2 | 40.4 |
|
73 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
74 |
| Yi-VL-6B | 39.1 | 37.8 |
|
@@ -79,11 +82,11 @@ Our model achieved favorable results on the many leaderboards.
|
|
79 |
|
80 |
- **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
|
81 |
|
82 |
-
| Model | Val (900) | Test (11K)
|
83 |
|--------------------------------|:---------:|:------------:|
|
84 |
-
| GPT-4V(ision) (Playground) | 42.5 | 43.7
|
85 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
86 |
-
| **CatVision**
|
87 |
| Yi-VL-34B | 36.2 | 36.5 |
|
88 |
| Yi-VL-6B | 35.8 | 35.0 |
|
89 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
@@ -104,7 +107,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
104 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
105 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
106 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
107 |
-
| **CatVision**
|
108 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
109 |
|
110 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
@@ -113,7 +116,7 @@ Our model achieved favorable results on the many leaderboards.
|
|
113 |
|---------------|:----------:|:---------:|
|
114 |
| GPT4v | 1409.43 | 517.14 |
|
115 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
116 |
-
| **CatVision**
|
117 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
118 |
|
119 |
- **Open Compress**
|
|
|
4 |
|
5 |
# CatVision
|
6 |
|
|
|
7 |
## Introduction
|
8 |
|
9 |
+
A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
|
10 |
+
|
11 |
+
一个开源多模态大模型,紧密模拟了GPT4V/Qwen-VL-PLUS系列模型的功能。该模型建立在Qwen-72b-Chat的基础上,可以处理包含交错的图文输入。该模型从Qwen72b的优势中受益,旨在有效地遵循输出格式指令。
|
12 |
|
13 |
+
Our model performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.
|
14 |
+
|
15 |
+
我们的模型在很多数据集上,接近闭源的Qwen-VL-PLUS的效果,并大幅超过开源模型Qwen-VL-7B-Chat的效果。
|
16 |
|
17 |
- Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
|
18 |
|
|
|
71 |
| Gemini Pro | 47.9 | ---- |
|
72 |
| Yi-VL-34B | 45.9 | 41.6 |
|
73 |
| Qwen-VL-PLUS | 45.2 | 40.8 |
|
74 |
+
| **CatVision** | 45.9 | 40.1 |
|
75 |
| Macro-VL | 41.2 | 40.4 |
|
76 |
| InfiMM-Zephyr-7B | 39.4 | 35.5 |
|
77 |
| Yi-VL-6B | 39.1 | 37.8 |
|
|
|
82 |
|
83 |
- **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
|
84 |
|
85 |
+
| Model | Val (900) | Test (11K) |
|
86 |
|--------------------------------|:---------:|:------------:|
|
87 |
+
| GPT-4V(ision) (Playground) | 42.5 | 43.7 |
|
88 |
| Qwen-VL-PLUS* | 39.5 | 36.8 |
|
89 |
+
| **CatVision** | 39.6 | ---- |
|
90 |
| Yi-VL-34B | 36.2 | 36.5 |
|
91 |
| Yi-VL-6B | 35.8 | 35.0 |
|
92 |
| Qwen-VL-7B-Chat | 30.7 | 31.3 |
|
|
|
107 |
| Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
|
108 |
| GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
|
109 |
| Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
|
110 |
+
| **CatVision** | 70.9 | 71.8 | 70.2 | 71.6 | 49.8 |
|
111 |
| Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
|
112 |
|
113 |
- **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
|
|
|
116 |
|---------------|:----------:|:---------:|
|
117 |
| GPT4v | 1409.43 | 517.14 |
|
118 |
| Qwen-VL-PLUS | 1681.25 | 502.14 |
|
119 |
+
| **CatVision** | 1560.90 | 366.43 |
|
120 |
| Qwen-VL-Chat | 1487.57 | 360.71 |
|
121 |
|
122 |
- **Open Compress**
|