CityVAE for SDXL

The following is a proof-of-concept VAE for Stable Diffusion XL to see if it is possible to fix some of the issues present in the v0.9 and v1.0 VAE files, specifically the "digital noise" present when upscaling latents. Further information can also be found on this github issue.

The current image output quality is considerably worse than the original VAE. Again, this was just to see if it was possible to "fix" one specific issue. Turns out, I don't have the hardware to fully train a VAE from scratch in a reasonable amount of time.

v0.1/v0.2

Unpublished test versions on a limited 500 image dataset to check validity before training. Not uploaded anywhere.

v0.3

The training for v0.3 was done on a single RTX 3080 10GB GPU using the original Latent Diffusion reference implementation. The only modification to the code was a custom dataloader, a few fixes for torch2/xformers and a custom function that made sure the encoder weights were static (to ensure model compatibility).

Training took ~20+ hours and was done on two separate datasets. 100 epochs on a filtered subset of the Wikimedia Foundation Image Dump (November 2005) and an extra 200 epochs on the Flickr2K dataset using transforms.FiveCrop.

(I have two VAE Encode/Upscale Latent nodes in the image below, but the encoder for both is the same so it would be the same even if you used v0.9 as the input on both, I just reused the workflow from the v1.5 test.)