Why distilling to SSD-1B rather than SD1.5/2.1?

#10
by eeyrw - opened

Notice that SSD-1B has close numbers of parameters of UNet comparing to SD1.5/2.1. So why use the UNet truncated from SDXL rather than reuse SD1.5/2.1? Maybe for the sake of the two language models which SDXL used?

Segmind org

Yes, to benefit from two text encoders, as well as to retain the native 1024 generation capability.

So SSD-1B retained the original weights of layers not been cut before distilling, which is the 1024^2 generation capability from? I have tried to finetune SD1.5 with 1024^2 datasets but never got some ideal result. So I am quite curious why SDXL can generate 1024^2 images just because SDXL has more parameters or just the low resolution pretrain SD1.5 used?

SDXL was trained on several million images and a few million steps at 768x768 and 1024x1024 resolutions. The 768x768 pretraining likely helped provide a base and the 1024x1024 polished its capabilities in that regard. Might be that your data/steps weren't enough. The UNet architecture does not significantly vary between the two. Sure, the text-encoders and the embedding styles are quite different.

Sign up or log in to comment