huizhang0110
/

CatVision

Text Generation

Transformers

Safetensors

qwen

custom_code

Model card Files Files and versions Community

Zhang Hui commited on Jan 26

Commit

83bb89e

•

1 Parent(s): 3fd04ab

update readme

Browse files

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ license: apache-2.0
 A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
-一个多模态的大规模模型，以其开源的特性为特点，紧密模拟了GPT4V/Qwen-VL-Plus模型的功能。该模型建立在Qwen-72b-Chat的基础上，CatVision可以处理包含交错的图像/文本输入。该模型旨在有效地遵循输出格式的指令，从Qwen72b的优势中受益。
 - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
@@ -68,7 +68,7 @@ Our model achieved favorable results on the many leaderboards.
 | Gemini Pro                     |   47.9    |     ----     |
 | Yi-VL-34B                      |   45.9    |     41.6     |
 | Qwen-VL-PLUS                   |   45.2    |     40.8     |
-| *CatVision*                    |   45.9    |     40.1     |
 | Macro-VL                       |   41.2    |     40.4     |
 | InfiMM-Zephyr-7B                |   39.4    |     35.5     |
 | Yi-VL-6B                       |   39.1    |     37.8     |
@@ -83,7 +83,7 @@ Our model achieved favorable results on the many leaderboards.
 |--------------------------------|:---------:|:------------:|
 | GPT-4V(ision) (Playground)     |   42.5    |     43.7   |
 | Qwen-VL-PLUS*                  |   39.5    |     36.8     |
-| *CatVision*                    |   39.6    |     ----     |
 | Yi-VL-34B                      |   36.2    |     36.5     |
 | Yi-VL-6B                       |   35.8    |     35.0     |
 | Qwen-VL-7B-Chat                |   30.7    |     31.3     |
@@ -104,7 +104,7 @@ Our model achieved favorable results on the many leaderboards.
 | Qwen-VL-PLUS(BASE)  | 83.3              | 83.2             | 82.7              | 81.5             | 77.6    |
 | GPT4v               | 77.0              | 75.1             | 74.4              | 75.0             | 46.5    |
 | Qwen-VL-PLUS        | 67.0              | 66.2             | 70.7              | 69.6             | 55.1    |
-| *CatVision*         | 70.9              | 71.8             | 70.2              | 71.6             | 49.8    |
 | Qwen-VL-Chat        | 61.8              | 60.6             | 56.3              | 56.7             | 41.2    |
 - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
@@ -113,7 +113,7 @@ Our model achieved favorable results on the many leaderboards.
 |---------------|:----------:|:---------:|
 | GPT4v         | 1409.43    | 517.14    |
 | Qwen-VL-PLUS  | 1681.25    | 502.14    |
-| *CatVision*   | 1560.90    | 366.43    |
 | Qwen-VL-Chat  | 1487.57    | 360.71    |
 - **Open Compress**

 A multimodal large-scale model, characterized by its open-source nature, closely emulates the functionalities of the GPT4V/Qwen-VL-Plus model. Built upon the foundation of Qwen-72b-Chat, CatVision in handling inputs that combine both images and text. This model is designed to effectively follow instructions for output formats, benefiting from the strengths of Qwen72b.
+一个开源多模态大模型，紧密模拟了GPT4V/Qwen-VL系列模型的功能。该模型建立在Qwen-72b-Chat的基础上，CatVision可以处理包含交错的图像/文本输入。该模型旨在有效地遵循输出格式的指令，从Qwen72b的优势中受益。
 - Our training approach consisted of two stages, inspired by LLava1.5. In the initial stage, we trained the visual encoder + perceptual resampler, and in the second stage, we focused on training the large language model + perceptual resampler with instructional data. To overcome limited computational resources (32xA100-80G), we used Lora for training in both stages.
 | Gemini Pro                     |   47.9    |     ----     |
 | Yi-VL-34B                      |   45.9    |     41.6     |
 | Qwen-VL-PLUS                   |   45.2    |     40.8     |
+| **CatVision**                    |   45.9    |     40.1     |
 | Macro-VL                       |   41.2    |     40.4     |
 | InfiMM-Zephyr-7B                |   39.4    |     35.5     |
 | Yi-VL-6B                       |   39.1    |     37.8     |
 |--------------------------------|:---------:|:------------:|
 | GPT-4V(ision) (Playground)     |   42.5    |     43.7   |
 | Qwen-VL-PLUS*                  |   39.5    |     36.8     |
+| **CatVision**                    |   39.6    |     ----     |
 | Yi-VL-34B                      |   36.2    |     36.5     |
 | Yi-VL-6B                       |   35.8    |     35.0     |
 | Qwen-VL-7B-Chat                |   30.7    |     31.3     |
 | Qwen-VL-PLUS(BASE)  | 83.3              | 83.2             | 82.7              | 81.5             | 77.6    |
 | GPT4v               | 77.0              | 75.1             | 74.4              | 75.0             | 46.5    |
 | Qwen-VL-PLUS        | 67.0              | 66.2             | 70.7              | 69.6             | 55.1    |
+| **CatVision**         | 70.9              | 71.8             | 70.2              | 71.6             | 49.8    |
 | Qwen-VL-Chat        | 61.8              | 60.6             | 56.3              | 56.7             | 41.2    |
 - **[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)**
 |---------------|:----------:|:---------:|
 | GPT4v         | 1409.43    | 517.14    |
 | Qwen-VL-PLUS  | 1681.25    | 502.14    |
+| **CatVision**   | 1560.90    | 366.43    |
 | Qwen-VL-Chat  | 1487.57    | 360.71    |
 - **Open Compress**