Zhang Hui commited on
Commit
ff8d687
1 Parent(s): 83bb89e

update readme

Browse files
Files changed (1) hide show
  1. README.md +12 -9
README.md CHANGED
@@ -4,12 +4,15 @@ license: apache-2.0
4
 
5
  # CatVision
6
 
7
-
8
  ## Introduction
9
 
10
- A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
 
 
11
 
12
- 一个开源多模态大模型,紧密模拟了GPT4V/Qwen-VL系列模型的功能。该模型建立在Qwen-72b-Chat的基础上,CatVision可以处理包含交错的图像/文本输入。该模型旨在有效地遵循输出格式的指令,从Qwen72b的优势中受益。
 
 
13
 
14
  - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
15
 
@@ -68,7 +71,7 @@ Our model achieved favorable results on the many leaderboards.
68
  | Gemini Pro | 47.9 | ---- |
69
  | Yi-VL-34B | 45.9 | 41.6 |
70
  | Qwen-VL-PLUS | 45.2 | 40.8 |
71
- | **CatVision** | 45.9 | 40.1 |
72
  | Macro-VL | 41.2 | 40.4 |
73
  | InfiMM-Zephyr-7B | 39.4 | 35.5 |
74
  | Yi-VL-6B | 39.1 | 37.8 |
@@ -79,11 +82,11 @@ Our model achieved favorable results on the many leaderboards.
79
 
80
  - **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
81
 
82
- | Model | Val (900) | Test (11K) |
83
  |--------------------------------|:---------:|:------------:|
84
- | GPT-4V(ision) (Playground) | 42.5 | 43.7 |
85
  | Qwen-VL-PLUS* | 39.5 | 36.8 |
86
- | **CatVision** | 39.6 | ---- |
87
  | Yi-VL-34B | 36.2 | 36.5 |
88
  | Yi-VL-6B | 35.8 | 35.0 |
89
  | Qwen-VL-7B-Chat | 30.7 | 31.3 |
@@ -104,7 +107,7 @@ Our model achieved favorable results on the many leaderboards.
104
  | Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
105
  | GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
106
  | Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
107
- | **CatVision** | 70.9 | 71.8 | 70.2 | 71.6 | 49.8 |
108
  | Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
109
 
110
  - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
@@ -113,7 +116,7 @@ Our model achieved favorable results on the many leaderboards.
113
  |---------------|:----------:|:---------:|
114
  | GPT4v | 1409.43 | 517.14 |
115
  | Qwen-VL-PLUS | 1681.25 | 502.14 |
116
- | **CatVision** | 1560.90 | 366.43 |
117
  | Qwen-VL-Chat | 1487.57 | 360.71 |
118
 
119
  - **Open Compress**
 
4
 
5
  # CatVision
6
 
 
7
  ## Introduction
8
 
9
+ A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
10
+
11
+ 一个开源多模态大模型,紧密模拟了GPT4V/Qwen-VL-PLUS系列模型的功能。该模型建立在Qwen-72b-Chat的基础上,可以处理包含交错的图文输入。该模型从Qwen72b的优势中受益,旨在有效地遵循输出格式指令。
12
 
13
+ Our model performs close to the closed-source Qwen-VL-PLUS on many datasets and significantly surpasses the performance of the open-source model Qwen-VL-7B-Chat.
14
+
15
+ 我们的模型在很多数据集上,接近闭源的Qwen-VL-PLUS的效果,并大幅超过开源模型Qwen-VL-7B-Chat的效果。
16
 
17
  - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
18
 
 
71
  | Gemini Pro | 47.9 | ---- |
72
  | Yi-VL-34B | 45.9 | 41.6 |
73
  | Qwen-VL-PLUS | 45.2 | 40.8 |
74
+ | **CatVision** | 45.9 | 40.1 |
75
  | Macro-VL | 41.2 | 40.4 |
76
  | InfiMM-Zephyr-7B | 39.4 | 35.5 |
77
  | Yi-VL-6B | 39.1 | 37.8 |
 
82
 
83
  - **[CMMMU](https://github.com/CMMMU-Benchmark/CMMMU/blob/main/README.md)**
84
 
85
+ | Model | Val (900) | Test (11K) |
86
  |--------------------------------|:---------:|:------------:|
87
+ | GPT-4V(ision) (Playground) | 42.5 | 43.7 |
88
  | Qwen-VL-PLUS* | 39.5 | 36.8 |
89
+ | **CatVision** | 39.6 | ---- |
90
  | Yi-VL-34B | 36.2 | 36.5 |
91
  | Yi-VL-6B | 35.8 | 35.0 |
92
  | Qwen-VL-7B-Chat | 30.7 | 31.3 |
 
107
  | Qwen-VL-PLUS(BASE) | 83.3 | 83.2 | 82.7 | 81.5 | 77.6 |
108
  | GPT4v | 77.0 | 75.1 | 74.4 | 75.0 | 46.5 |
109
  | Qwen-VL-PLUS | 67.0 | 66.2 | 70.7 | 69.6 | 55.1 |
110
+ | **CatVision** | 70.9 | 71.8 | 70.2 | 71.6 | 49.8 |
111
  | Qwen-VL-Chat | 61.8 | 60.6 | 56.3 | 56.7 | 41.2 |
112
 
113
  - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
 
116
  |---------------|:----------:|:---------:|
117
  | GPT4v | 1409.43 | 517.14 |
118
  | Qwen-VL-PLUS | 1681.25 | 502.14 |
119
+ | **CatVision** | 1560.90 | 366.43 |
120
  | Qwen-VL-Chat | 1487.57 | 360.71 |
121
 
122
  - **Open Compress**