openbmb
/

VisCPM-Chat

+---
+language:
+- en
+- zh
+---
+<div align="center">
+**VisCPM**
+**Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model**
+<p align="center">
+  <a href="https://github.com/OpenBMB/VisCPM">Github</a> •
+  <a href="https://huggingface.co/openbmb/VisCPM-Paint">VisCPM-Paint</a>
+</p>
+</div>
+`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model [CPM-Bee](https://github.com/OpenBMB/CPM-Bee) with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
+- **👐 Open-source Usage**: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
+- **🌟 Image and text generation coverage**: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
+- **💫 Excellent bilingual performance**: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
+## VisCPM-Chat
+`VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Q-Former` as the visual encoder and `CPM-Bee` (10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: pretraining and instruction tuning.
+* Pretraining: `VisCPM-Chat` is pretrained using approximately 100M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of the `Q-Former` are updated to enable efficient alignment of vision and language representations.
+* Instruction Tuning: We utilize the [LLaVA-150K](https://llava-vl.github.io/) dataset that contains English multimodal instruction-following data. We mix this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model's response language with the user's question language.
+We evaluate the model on the standard [LLaVA English test set](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) and the translated [Chinese test set](data/translated_LLaVA_qa90) from the standard English test set. The evaluation benchmark examines the model's performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that `VisCPM-Chat` achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities. We provide two versions of the model, namely `VisCPM-Chat-balance` and `VisCPM-Chat-zhplus`. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. `VisCPM-Chat-zhplus` additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage.
+<table>
+    <tr>
+        <td align="center" rowspan="2" colspan="2">Model</td>
+        <td align="center" rowspan="2">LLM Backbone</td>
+        <td align="center" colspan="4">English</td>
+        <td align="center" colspan="4">Chinese</td>
+    </tr>
+    <tr>
+        <td align="center">Conversation</td>
+        <td align="center">Detailed Description</td>
+        <td align="center">Complex Reasoning</td>
+        <td align="center">Avg</td>
+        <td align="center">Conversation</td>
+        <td align="center">Detailed Description</td>
+        <td align="center">Complex Reasoning</td>
+        <td align="center">Avg</td>
+    </tr>
+    <tr>
+        <td align="center" rowspan="3">English Model</td>
+        <td align="center">MiniGPT4</td>
+        <td align="center">Vicuna-13B</td>
+        <td align="center">65.0</td>
+        <td align="center">67.3</td>
+        <td align="center">76.6</td>
+        <td align="center">69.7</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center">InstructBLIP</td>
+        <td align="center">Vicuna-13B</td>
+        <td align="center">81.9</td>
+        <td align="center">68.0</td>
+        <td align="center">91.2</td>
+        <td align="center">80.5</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center">LLaVA</td>
+        <td align="center">Vicuna-13B</td>
+        <td align="center"><b>89.5</b></td>
+        <td align="center"><b>70.4</b></td>
+        <td align="center"><b>96.2</b></td>
+        <td align="center"><b>85.6</b></td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+        <td align="center">-</td>
+    </tr>
+    <tr>
+        <td align="center" rowspan="5">En-Zh Bilingual Model</td>
+        <td align="center">mPLUG-Owl </td>
+        <td align="center">LLaMA-7B</td>
+        <td align="center">64.6</td>
+        <td align="center">47.7</td>
+        <td align="center">80.1</td>
+        <td align="center">64.2</td>
+        <td align="center">76.3</td>
+        <td align="center">61.2</td>
+        <td align="center">77.8</td>
+        <td align="center">72.0</td>
+    </tr>
+    <tr>
+        <td align="center">VisualGLM</td>
+        <td align="center">ChatGLM-6B</td>
+        <td align="center">62.4</td>
+        <td align="center">63.0</td>
+        <td align="center">80.6</td>
+        <td align="center">68.7</td>
+        <td align="center">76.6</td>
+        <td align="center"><b>87.8</b></td>
+        <td align="center">83.6</td>
+        <td align="center">82.7</td>
+    </tr>
+    <tr>
+        <td align="center">Ziya-Visual </td>
+        <td align="center">Ziya-LLaMA-13B-v1</td>
+        <td align="center">82.7</td>
+        <td align="center">69.9</td>
+        <td align="center">92.1</td>
+        <td align="center">81.7</td>
+        <td align="center">85.0</td>
+        <td align="center">74.7</td>
+        <td align="center">82.4</td>
+        <td align="center">80.8</td>
+    </tr>
+    <tr>
+        <td align="center">VisCPM-Chat-balance</td>
+        <td align="center">CPMBee-10B</td>
+        <td align="center">83.3</td>
+        <td align="center">68.9</td>
+        <td align="center">90.5</td>
+        <td align="center">81.1</td>
+        <td align="center"><b>92.7</b></td>
+        <td align="center">76.1</td>
+        <td align="center">89.2</td>
+        <td align="center">86.3</td>
+    </tr>
+    <tr>
+        <td align="center">VisCPM-Chat-zhplus</td>
+        <td align="center">CPMBee-10B</td>
+        <td align="center">80.1</td>
+        <td align="center">65.7</td>
+        <td align="center">92.5</td>
+        <td align="center">79.6</td>
+        <td align="center">90.3</td>
+        <td align="center">81.4</td>
+        <td align="center"><b>92.1</b></td>
+        <td align="center"><b>88.2</b></td>
+    </tr>
+</table>
+## 📝 License
+VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to negotiate commercial licensing.
+The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.