README.md · openbmb/VisCPM-Paint at main

metadata

language:
  - en
  - zh

VisCPM

Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model

Github • VisCPM-Chat

VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model CPM-Bee with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, VisCPM can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.

👐 Open-source Usage: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
🌟 Image and text generation coverage: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
💫 Excellent bilingual performance: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.

VisCPM-Paint

VisCPM-Paint supports bilingual text-to-image generation. The model uses CPM-Bee as the text encoder, UNet as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the LAION 2B English text-image pair dataset.

Similar to VisCPM-Chat, we found that due to the bilingual capability of CPM-Bee, VisCPM-Paint can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely VisCPM-Paint-balance and VisCPM-Paint-zhplus. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. VisCPM-Paint-balance is trained only using English text-image pairs, while VisCPM-Paint-zhplus incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on VisCPM-Paint-balance.

How to Use

#!/usr/bin/env python
# encoding: utf-8
from diffusers import DiffusionPipeline
from transformers import AutoModel
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
text_encoder = AutoModel.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
print('load pipeline')
pipeline = DiffusionPipeline.from_pretrained('openbmb/VisCPM-Paint', custom_pipeline="openbmb/VisCPM-Paint", text_encoder=text_encoder, tokenizer=tokenizer)

pipeline = pipeline.to('cuda')

prompt = "a photo of an astronaut riding a horse on mars"
image = pipeline(prompt).images[0]

image.save("astronaut_rides_horse.png")

📝 License

VisCPM is governed by the GML License, and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to negotiate commercial licensing.

The CPM-Bee base, governed by the General Model License (GML), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.