Update README.md
#1
by
rmyj
- opened
README.md
CHANGED
@@ -156,8 +156,7 @@ Stable Diffusion v1 is a latent diffusion model which combines an autoencoder wi
|
|
156 |
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
157 |
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
|
158 |
|
159 |
-
We currently provide
|
160 |
-
which were trained as follows,
|
161 |
|
162 |
- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
|
163 |
194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
|
@@ -165,9 +164,9 @@ which were trained as follows,
|
|
165 |
515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
|
166 |
filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
|
167 |
- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
168 |
-
- `sd-v1-4.ckpt`: Resumed from
|
169 |
-
- `sd-v1-5.ckpt`: Resumed from sd-v1-2.ckpt
|
170 |
-
- `sd-v1-5-
|
171 |
|
172 |
|
173 |
- **Hardware:** 32 x 8 x A100 GPUs
|
156 |
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
|
157 |
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
|
158 |
|
159 |
+
We currently provide one checkpoint, `sd-v1-5-inpainting.ckpt`, and mention `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and `sd-v1-3.ckpt`, `sd-v1-4.ckpt` and `sd-v1-5.ckpt` which were trained as follows,
|
|
|
160 |
|
161 |
- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
|
162 |
194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
|
164 |
515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
|
165 |
filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
|
166 |
- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
167 |
+
- `sd-v1-4.ckpt`: Resumed from `sd-v1-2.ckpt`. 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
168 |
+
- `sd-v1-5.ckpt`: Resumed from `sd-v1-2.ckpt`. 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
|
169 |
+
- `sd-v1-5-inpainting.ckpt`: Resumed from sd-v1-5.ckpt. 440k steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.
|
170 |
|
171 |
|
172 |
- **Hardware:** 32 x 8 x A100 GPUs
|