retrain vae further

#10
by darshat - opened

Hi,
Is it possible to further train this VAE? The scenario is to preserve text details in an image and then use it with SDXL. It would be great to retrain this vae on images that contain text.
Thanks!

Somewhat possible :) but it's out of scope for this repo.

SDXL-VAE compresses every 8x8 patch of input (RGB) pixels into 1 (RGBA) pixel, so the SDXL-VAE latents can store at most 1/48th the information of the original image (which means the SDXL-VAE encoder always has to throw away information, and the SDXL-VAE decoder always has to make up new details).

Given how compressed the latents are, SDXL-VAE actually does an incredibly good job at recovering text. SDXL-VAE only struggles when the text is very small:

image.png

You can probably get better small-size letterforms by fine-tuning the VAE decoder on text images, like you described (I think SGM has code for VAE training)... but it won't fix the fundamental lack of information in the latents (small letterforms will still be "made up" during decoding), and it also won't improve the UNet / diffusion process (which is the reason that SDXL generates nonsense text even at very large font sizes)

image.png

image.png

Thanks @madebyollin for the comprehensive reply! I will try out the sgm link

Hi @madebyollin , I'm able to use the sgm scripts to train kl-f4 vae. But the config of a vae in sgm repo differs from the config used here for sdxl-vae (autoencoderKL class also different).

The sgm scripts are useful as they define the lpips and loss types. Can you share how did you retrain - did you use the sgm repo scripts but somehow find a mapping to the sdxl vae class and config?

I trained sdxl-vae-fp16-fix using the SGM AutoencoderKL model class (+ my own training notebook), and my trained SGM-compatible sdxl-vae-fp16-fix weights were converted to diffusers format post-hoc.

I haven't personally used Stability's VAE training code, but their SDXL inference config includes all the VAE settings. You can probably copy those VAE settings into the kl-f4 training config to get a kl-f8 config that works with the SGM-compatible sdxl-vae-fp16-fix weights.

Please create a repo for that sdxl vae finetuning notebook! Basically there is no guide anywhere on how to create a training loop for training the sdxl vae. And also what is the dataset have you used for the finetuning process. I also need to how you have managed to visualize the parameter of the vae in a safetensors file. did you use the state_dict? Also how to visualize a latent image without decoding it? man you are really awesome!

@MkJojo

how you have managed to visualize the parameter of the vae in a safetensors file

Here's the script I've been using to compare my fp16 VAE weights to the original VAE weights.

how to visualize a latent image without decoding it

Here's the notebook I used for the comparison image above.

I tried the sgm pointer for training but wasnt successful. Posted to the sgm [https://github.com/Stability-AI/generative-models/issues/121] issues also for help, but no reply so far. Would be very helpful to know how you trained it @madebyollin .

Sign up or log in to comment