From scratch, or not?
I cant seem to find a clear answer in the huggingface model cards, etc:
Are these models created from scratch, just using the sdxl architecture?
Or are they trained on top of sdxl base?
I'm thinking from scratch, but I need an explicit statement of that please?
I know it's been a while, but here's the paper it's based on. Yes, in terms of any actual visual information used, no in terms of derived technologies like machine vision for the purposes of captioning. https://arxiv.org/pdf/2310.16825
thanks for the reply.... not understanding how the words match up to my question.
wading through the paper, they say that they use "the sdxl unet".
it is unclear whether that means "they used just the ARCHITECTURE, but trained the model from scratch", or that they used
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/unet/diffusion_pytorch_model.safetensors
The unet models weights are trained from scratch.