IDEA-CCNL
/

Taiyi-Stable-Diffusion-XL-3.5B

@@ -22,6 +22,8 @@ The surge in text-to-image models like Google's Imagen, OpenAI's DALL-E 3, and S
 # 模型训练 Model Training
 Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先，我们制作了一个高质量的图文对数据集，每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性，我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集，确保了相关性和细节。然后，我们从预训练的英文CLIP模型开始，为了更好地支持中文和长文本我们扩展了模型的词表和位置编码，通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后，我们基于Stable-Diffusion-XL，替换了第二阶段获得的text encoder，在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
 The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.

 # 模型训练 Model Training
+![Taiyi-Diffusion-XL训练过程](imgs/overview_00.png)
 Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先，我们制作了一个高质量的图文对数据集，每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性，我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集，确保了相关性和细节。然后，我们从预训练的英文CLIP模型开始，为了更好地支持中文和长文本我们扩展了模型的词表和位置编码，通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后，我们基于Stable-Diffusion-XL，替换了第二阶段获得的text encoder，在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
 The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.

imgs/overview_00.png ADDED Viewed

Git LFS Details

SHA256: d90e0290816440838054e8632decf7e07247072f76bb61b8f97bf3eaa9d6c7f4
Pointer size: 132 Bytes
Size of remote file: 4.98 MB