init readme
Browse files- README.md +2 -0
- imgs/overview_00.png +3 -0
README.md
CHANGED
@@ -22,6 +22,8 @@ The surge in text-to-image models like Google's Imagen, OpenAI's DALL-E 3, and S
|
|
22 |
|
23 |
# 模型训练 Model Training
|
24 |
|
|
|
|
|
25 |
Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先,我们制作了一个高质量的图文对数据集,每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性,我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集,确保了相关性和细节。然后,我们从预训练的英文CLIP模型开始,为了更好地支持中文和长文本我们扩展了模型的词表和位置编码,通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后,我们基于Stable-Diffusion-XL,替换了第二阶段获得的text encoder,在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
|
26 |
|
27 |
The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.
|
|
|
22 |
|
23 |
# 模型训练 Model Training
|
24 |
|
25 |
+
![Taiyi-Diffusion-XL训练过程](imgs/overview_00.png)
|
26 |
+
|
27 |
Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先,我们制作了一个高质量的图文对数据集,每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性,我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集,确保了相关性和细节。然后,我们从预训练的英文CLIP模型开始,为了更好地支持中文和长文本我们扩展了模型的词表和位置编码,通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后,我们基于Stable-Diffusion-XL,替换了第二阶段获得的text encoder,在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
|
28 |
|
29 |
The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.
|
imgs/overview_00.png
ADDED
Git LFS Details
|