介绍 Introduction

Tech Report 技术报告: https://arxiv.org/abs/2401.14688
Demo 体验地址: https://huggingface.co/spaces/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B
Train Code 训练代码：https://github.com/IDEA-CCNL/Taiyi-Diffusion-XL
Deployment Webui 推理部署: https://github.com/IDEA-CCNL/Fooocus-Taiyi-XL

文生图模型如谷歌的Imagen、OpenAI的DALL-E 3和Stability AI的Stable Diffusion引领了AIGC和数字艺术创作的新浪潮。然而，基于SD v1.5的中文文生图模型，如Taiyi-Diffusion-v0.1和Alt-Diffusion的效果仍然一般。中国的许多AI绘画平台仅支持英文，或依赖中译英的翻译工具。目前的开源文生图模型主要支持英文，双语支持有限。我们的工作，Taiyi-Diffusion-XL（Taiyi-XL），在这些发展的基础上，专注于保留英文理解能力的同时增强中文文生图生成能力，更好地支持双语文生图。

The surge in text-to-image models like Google's Imagen, OpenAI's DALL-E 3, and Stability AI's Stable Diffusion has revolutionized digital art creation. However, the effectiveness of Chinese text-to-image models, such as taiyi-diffusion-v0.1 and alt-diffusion based on SD v1.5, remains moderate. Many AI art platforms in China support only English or rely on Chinese-to-English translation tools. Current open-source text-to-image models predominantly support English, with limited bilingual capabilities. Our work, Taiyi-Diffusion-XL (Taiyi-XL), builds on these developments, focusing on enhancing Chinese text-to-image generation while retaining English proficiency, addressing the unique challenges of bilingual language processing.

模型训练 Model Training

Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先，我们制作了一个高质量的图文对数据集，每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性，我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集，确保了相关性和细节。然后，我们从预训练的英文CLIP模型开始，为了更好地支持中文和长文本我们扩展了模型的词表和位置编码，通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后，我们基于Stable-Diffusion-XL，替换了第二阶段获得的text encoder，在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。

The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.

模型评估 Model Evaluation

机器评估 Machine Evaluation

我们的机器评估包括了对不同模型的全面比较。评估指标包括CLIP相似度（CLIP Sim）、IS和FID，为每个模型在图像质量、多样性和与文本描述的对齐方面提供了全面的评估。在英文数据集（COCO）中，Taiyi-XL在所有指标上表现优异，获得了最好的CLIP Sim、IS和FID得分。这表明Taiyi-XL在生成与英文文本提示紧密对齐的图像方面非常有效，同时保持了高图像质量和多样性。同样，在中文数据集（COCO-CN）中，Taiyi-XL也超越了其他模型，展现了其强大的双语能力。

Our machine evaluation involved a comprehensive comparison of various models. The evaluation metrics included CLIP Similarity (CLIP Sim), Inception Score (IS), and Fréchet Inception Distance (FID), providing a robust assessment of each model's performance in terms of image quality, diversity, and alignment with textual descriptions. In the English dataset (COCO), Taiyi-XL demonstrated superior performance across all metrics, achieving the highest scores in CLIP Sim, IS, and FID. This indicates Taiyi-XL's effectiveness in generating images closely aligned with English text prompts while maintaining high image quality and diversity. Similarly, in the Chinese dataset (COCO-CN), Taiyi-XL outperformed other models, showcasing its robust bilingual capabilities.

Table: Comparison of different models based on CLIP Sim, IS, and FID across English (COCO) and Chinese (COCO-CN) datasets

Model	CLIP Sim($\uparrow$)	FID($\downarrow$)	IS($\uparrow$)
English Dataset (COCO)
Alt-Diffusion	0.220	27.600	31.577
SD-v1.5	0.225	25.342	32.876
SD-XL	0.231	23.887	33.793
Taiyi-XL	0.254	22.543	35.465
Chinese Dataset (COCO-CN)
Taiyi-v0.1	0.197	69.226	21.060
Alt-Diffusion	0.220	68.488	22.126
Pai-Diffusion	0.196	72.572	19.145
Taiyi-XL	0.225	67.675	22.965

The best results are marked in bold.

人类偏好评估 Human Preference Evaluation

如下图所示，比较了不同模型在中英文文生图生成方面的表现。XL版本模型，如SD-XL和Taiyi-XL，在1.5版本模型如SD-v1.5和Alt-Diffusion上显示出显著改进。DALL-E 3以其生动的色彩和prompt-following的能力而著称。Taiyi-XL模型偏向生成摄影风格的图片，与Midjourney较为类似，但是Taiyi-XL并在双语（中英文）文生图生成方面表现更出色。

As shown in the figures below, a comparison of different models in Chinese and English text-to-image generation performance is presented. The XL version models, such as SD-XL and Taiyi-XL, show significant improvements over the 1.5 version models like SD-v1.5 and Alt-Diffusion. DALL-E 3 is renowned for its vibrant colors and its ability to closely follow text prompts, setting a high standard. Our Taiyi-XL model, with its photographic style, closely matches the performance of Midjourney and excels in bilingual (Chinese and English) text-to-image generation.

尽管Taiyi-XL可能还未能与商业模型相媲美，但它比当前双语开源模型优越不少。我们认为我们模型与商业模型的差距主要归因于训练数据的数量、质量和多样性的差异。我们的模型仅使用学术数据集和符合版权要求的图文数据进行训练，未使用Midjourney和DALL-E 3等生成数据。。正如大家所知的，版权问题仍然是文生图和AIGC模型最大的问题。当然由于数据限制，对于中国人像或者元素我们也希望开源社区进一步数据微调。

Although Taiyi-XL may not yet rival commercial models, it excels among current bilingual open-source models. The gap with commercial models is mainly due to differences in the quantity, quality, and diversity of training data. Our model is trained exclusively on copyright-compliant image-text data. We dont't use AI generated image such as Midjoueney or DALL-E 3. As is well known, copyright issues remain the biggest challenge in text-to-image and AI-generated content (AIGC) models.

我们还评估了使用潜在一致性模型（LCM）加速图像生成过程的影响。测试显示，推理步骤减少，图像质量下降。生成过程扩展到8步基本可以确保生成图像的质量；生成过程限制为1步时，生成的图像主要只展示了基本轮廓，缺乏更细致的细节。这一发现表明，LCM可以有效加速生成过程，但在步数和所需图像质量之间需要找到平衡。

We also evaluated the impact of using Latent Consistency Models (LCM) to accelerate the image generation process. The tests showed that as the number of inference steps decreases, the image quality declines. Extending the generation process to 8 steps generally ensures the quality of the generated images; when limited to a single step, the images mainly display basic outlines and lack finer details. This finding suggests that while LCM can effectively speed up the generation process, a balance must be struck between the number of steps and the desired image quality.

引用 Citation

如果您在您的工作中使用了我们的模型，可以引用我们的论文：

If you are using the resource for your work, please cite the our paper:

@misc{wu2024taiyidiffusionxl,
      title={Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support}, 
      author={Xiaojun Wu and Dixiang Zhang and Ruyi Gan and Junyu Lu and Ziwei Wu and Renliang Sun and Jiaxing Zhang and Pingjian Zhang and Yan Song},
      year={2024},
      eprint={2401.14688},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

IDEA-CCNL
/

Taiyi-Stable-Diffusion-XL-3.5B