wuxiaojun commited on
Commit
89e2ce6
1 Parent(s): 5da2d8e

init readme

Browse files
Files changed (2) hide show
  1. README.md +2 -0
  2. imgs/overview_00.png +3 -0
README.md CHANGED
@@ -22,6 +22,8 @@ The surge in text-to-image models like Google's Imagen, OpenAI's DALL-E 3, and S
22
 
23
  # 模型训练 Model Training
24
 
 
 
25
  Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先,我们制作了一个高质量的图文对数据集,每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性,我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集,确保了相关性和细节。然后,我们从预训练的英文CLIP模型开始,为了更好地支持中文和长文本我们扩展了模型的词表和位置编码,通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后,我们基于Stable-Diffusion-XL,替换了第二阶段获得的text encoder,在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
26
 
27
  The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.
 
22
 
23
  # 模型训练 Model Training
24
 
25
+ ![Taiyi-Diffusion-XL训练过程](imgs/overview_00.png)
26
+
27
  Taiyi-Diffusion-XL文生图模型训练主要包括了3个阶段。首先,我们制作了一个高质量的图文对数据集,每张图片都配有详细的描述性文本。为了克服网络爬取数据的局限性,我们使用先进的视觉-语言大模型生成准确描述图片的caption。这种方法丰富了我们的数据集,确保了相关性和细节。然后,我们从预训练的英文CLIP模型开始,为了更好地支持中文和长文本我们扩展了模型的词表和位置编码,通过大规模双语数据集扩展其双语能力。训练涉及对比损失函数和内存高效的方法。最后,我们基于Stable-Diffusion-XL,替换了第二阶段获得的text encoder,在第一阶段获得的数据集上进行扩散模型的多分辨率、多宽高比训练。
28
 
29
  The training of the Taiyi-Diffusion-XL text-to-image model encompasses three main stages. Initially, we created a high-quality dataset of image-text pairs, with each image accompanied by a detailed descriptive text. To overcome the limitations of web-crawled data, we employed advanced vision-language large models to generate accurate captions that precisely describe the images. This approach enriched our dataset, ensuring relevance and detail. Subsequently, we began with a pre-trained English CLIP model and expanded its vocabulary and position encoding to better support Chinese and longer texts. This expansion was achieved through training on a large-scale bilingual dataset, utilizing a contrastive loss function and a memory-efficient approach. Finally, based on Stable-Diffusion-XL, we replaced the text encoder obtained in the second stage and conducted multi-resolution, aspect-ratio-variant training of the diffusion model on the dataset prepared in the first stage. This comprehensive training process ensures that Taiyi-Diffusion-XL effectively supports bilingual text-to-image generation, catering to diverse linguistic and visual requirements.
imgs/overview_00.png ADDED

Git LFS Details

  • SHA256: d90e0290816440838054e8632decf7e07247072f76bb61b8f97bf3eaa9d6c7f4
  • Pointer size: 132 Bytes
  • Size of remote file: 4.98 MB