shunxing1234
commited on
Commit
•
5c9b4ab
1
Parent(s):
3328559
Upload 10 files
Browse files- .gitattributes +3 -0
- README.md +47 -3
- imgs/boy.SVG +0 -0
- imgs/boy.png +3 -0
- imgs/chinese_samples.png +3 -0
- imgs/chinese_samples.svg +0 -0
- imgs/corgi_dog.SVG +0 -0
- imgs/corgi_dog.png +3 -0
- imgs/long1.SVG +0 -0
- imgs/long2.SVG +0 -0
- imgs/model.png +0 -0
.gitattributes
CHANGED
@@ -32,3 +32,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
32 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
33 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
34 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
32 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
33 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
34 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
35 |
+
imgs/boy.png filter=lfs diff=lfs merge=lfs -text
|
36 |
+
imgs/chinese_samples.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
imgs/corgi_dog.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,3 +1,47 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## 模型信息
|
2 |
+
我们基于Stable Diffusion(SD)训练了首个支持18中语言的多语言SD——AltDiffusion-m18, 包含:英语、中文、日语、泰语、韩语、印地语、乌克兰语、阿拉伯语、土耳其语、越南语、波兰语、荷兰语、葡萄牙语、意大利语、西班牙语、德语、法语和俄语。
|
3 |
+
|
4 |
+
We have trained the first multilingual Stable Diffusion (SD) model that supports 18 languages, called AltDiffusion-m18. The languages included are English, Chinese, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, and Russian.
|
5 |
+
|
6 |
+
### 训练方法
|
7 |
+
如图1,所示训练分为两个阶段:概念对齐阶段和效果提升阶段。我们首先替换使用多语言CLIP AltCLIP-m18替换掉原始SD的OpenCLIP, 之后冻住AltCLIP的参数。在第一阶段中,使用256\*256的图片分辨率,训练Unet中CrossAttention层的k,v矩阵进行文图的概念对齐。在第二阶段中,使用512\*512的图片分辨率,训练Unet的所有参数进行生成效果的提升。
|
8 |
+
|
9 |
+
As shown in Figure 1, the training process consists of two stages: concept alignment and quality improvement. We first replaced the original OpenCLIP in SD with the multilingual CLIP AltCLIP-m18 and froze its parameters. In the first stage, we trained the k,v matrices in the CrossAttention layer of the Unet model to align the concepts between text and image using 256\*256 image resolution. In the second stage, we trained all the parameters in the Unet model to improve the generation performance using 512\*512 image resolution.
|
10 |
+
|
11 |
+
<img src="/imgs/model.png" alt="illustrate for AltDiffusion" style="zoom:35%;" />
|
12 |
+
|
13 |
+
<center>
|
14 |
+
图1: AltDiffusion示意图 (Fig.1: illustrate for AltDiffusion)
|
15 |
+
</center>
|
16 |
+
|
17 |
+
### 数据使用
|
18 |
+
在第一阶段中,我们使用[LAION 5B](https://laion.ai/blog/laion-5b/)中的LAION 5B-en(2.32B) 和 过滤的18语言 LAION 5B-multi(1.8B)数据进行训练。在第二阶段中,我们使用[LAION Aesthetics V1](https://laion.ai/blog/laion-aesthetics/)中的LAION Aesthetics V1-en(52M) 和 过滤的18语言 LAION Aesthetics V1-multi(46M)数据进行训练。
|
19 |
+
|
20 |
+
In the first stage, we trained the model using LAION 5B-en(2.32B) from [LAION 5B](https://laion.ai/blog/laion-5b/) and filtered LAION 5B-multi(1.8B) data for the 18 languages. In the second stage, we trained the model using LAION Aesthetics V1-en(52M) from [LAION Aesthetics V1](https://laion.ai/blog/laion-aesthetics/) and filtered LAION Aesthetics V1-multi(46M) data for the 18 languages.
|
21 |
+
|
22 |
+
### 训练细节
|
23 |
+
优化器:AdamW
|
24 |
+
|
25 |
+
学习率:1e-4 并带有10k步的warmup
|
26 |
+
|
27 |
+
显卡:64 张 NVIDIA A100-SXM4-40GB
|
28 |
+
第一阶段,从SD v2.1 512-base-ema开始,以batch size 3072在256\*256的分辨率上使用64张A100训练330k步,耗时8天;第二阶段,从第一阶段330k的checkpoint开始,以batch size 3840在512\*512的分辨率上使用64张A100训练270k步,耗时7天。然后,基于270k的checkpoint随机丢掉10%的文本进行150k步的classifier-free guidance训练,耗时4天。
|
29 |
+
|
30 |
+
The first stage involved using the SD v2.1 512-base-ema checkpoint to initialize all parameters except for the language model, with a batch size of 3072 and a resolution of 256x256 for training on LAION2B en and LAION2Bmulti for 330k steps over approximately 8 days. In the second stage, training began from the 330k step checkpoint, with a batch size of 3840 on LAION Aesthetics V1-en and V1-multi, and training for 270k steps with a resolution of 512x512, taking around 7 days. Training then continued from the 270k step checkpoint for another 150k steps, with 10% of the text randomly discarded for classifierfree guidance learning, taking approximately 4 days. The teacher model of AltCLIP is OpenCLIP ViT-H-14(version is ”laion2b s32b b79k”). The pretrained Stable Diffusion
|
31 |
+
checkpoint we used is SD v2.1 512-base-ema. We also use Xformer and Efficient Attention to save memory use and speed up training. The decay of EMA is 0.9999.
|
32 |
+
|
33 |
+
## 效果展示
|
34 |
+
|
35 |
+
### 18语言效果
|
36 |
+
![boy](/imgs/boy.SVG)
|
37 |
+
|
38 |
+
![corgi_dog](/imgs/corgi_dog.SVG)
|
39 |
+
|
40 |
+
### 中文效果
|
41 |
+
|
42 |
+
<img src="/imgs/chinese_samples.png" alt="chinese_samples" style="zoom:85%;" />
|
43 |
+
|
44 |
+
### 长图效果
|
45 |
+
![long1](/imgs/long1.SVG)
|
46 |
+
|
47 |
+
![long2](/imgs/long2.SVG)
|
imgs/boy.SVG
ADDED
imgs/boy.png
ADDED
Git LFS Details
|
imgs/chinese_samples.png
ADDED
Git LFS Details
|
imgs/chinese_samples.svg
ADDED
imgs/corgi_dog.SVG
ADDED
imgs/corgi_dog.png
ADDED
Git LFS Details
|
imgs/long1.SVG
ADDED
imgs/long2.SVG
ADDED
imgs/model.png
ADDED