how to finetune “refiner” model, or training lora for "refiener"?

#18
by daxijiu - opened

as title
It doesn't find anywhere how to train the refiner model. Maybe it should be trained in conjunction with the base model, and there are probably some tricks that need to be shared.

I am also interested in this. I can see from the documentation that the SDXL Img2Img pipeline linked below that the pipeline inherits the load_lora_weights method:
https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline

However, when I take LoRA weights that I got from Dreambooth LoRA training on the base SDXL, and try to use them in load_lora_weights on the SDXL Img2Img pipeline, I get an error.

Additionally, if I train the text encoders when creating LoRA for base SDXL, if I pass those text encoders to SDXL Img2Img when creating it, should I expect that to have an effect?

Third nod here, if anyone has some information on refiner training please let us know. there is very little working information as it stands as people attempt using old v2.1 methods on XL - that do not work haha. Thanks!

When you look into the SDXL training script, when you enable the text encoder training it trains both text encoders.

The refiner uses the same model text encoders, so once you instantiate a pipeline for the base model, you can reference those same text encoders when instantiating the refiner pipeline.

Does this have an effect? Not a great one. I have a Lora of myself - after passing the trained encoder to the refiner I feel like it still removes my identity because the UNET does not know me, but maybe slightly less so?

I have not seen any additional information about fine-tuning the refiner yet either.

I've been fooling around and I think we need proper support for the refiners.

The info from toolkit states they model components are: UNET-XL-Refiner, VAE-v1-SD, CLIP-XL-Refiner.

I tried merging and blending but failed horribly. there is a mismatch error "The size of tensor a (384) must match the size of tensor b (320) at non-singleton dimension 0"

So unless someone figures out how to train them directly, the older stuff cannot match with the newer (probably because it's 512 trained, not 1024 base)

Agreed. Only the UNET should need to be fine-tuned, and the trained text encoders can be reused from the base model.

That being said, the refiner is made to be used in ensemble of expert denoisers - where I received an already partially denoised image and completes only the last 20% or so of denoising.

I would imagine this would have to be replicated during training as well to get the best results. It sounds complicated, and probably the reason no one has a method to fine tune the refiner just yet.

The refiner model modules look quite similar to the base model. Would a tweak to the text_to_image_lora_sdxl.py script by removing the first text encoder module and changing the scheduler to only do 200 timesteps do something in this direction?

Sign up or log in to comment