Why distilling to SSD-1B rather than SD1.5/2.1?

#10

by eeyrw - opened Oct 25, 2023

Oct 25, 2023

Notice that SSD-1B has close numbers of parameters of UNet comparing to SD1.5/2.1. So why use the UNet truncated from SDXL rather than reuse SD1.5/2.1? Maybe for the sake of the two language models which SDXL used?

Warlord-K

Segmind org Oct 25, 2023

Yes, to benefit from two text encoders, as well as to retain the native 1024 generation capability.

eeyrw

Oct 26, 2023

•

edited Oct 26, 2023

So SSD-1B retained the original weights of layers not been cut before distilling, which is the 1024^2 generation capability from? I have tried to finetune SD1.5 with 1024^2 datasets but never got some ideal result. So I am quite curious why SDXL can generate 1024^2 images just because SDXL has more parameters or just the low resolution pretrain SD1.5 used?

Icar

Segmind org Oct 26, 2023

•

edited Oct 26, 2023

SDXL was trained on several million images and a few million steps at 768x768 and 1024x1024 resolutions. The 768x768 pretraining likely helped provide a base and the 1024x1024 polished its capabilities in that regard. Might be that your data/steps weren't enough. The UNet architecture does not significantly vary between the two. Sure, the text-encoders and the embedding styles are quite different.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment