In-between model?

#2
by paul-stamets - opened

First off, I just want to say - this is amazing! It really blew my mind. This took 1024x1024 renders on my M2 with SSD-1B LCM from 17.0s to 9.5s (compared to your sdxl-vae-fp16-fix, which already sped things up a bit from the default), which is just amazing. And for some things - namely when there's no human - it doesn't even look worse, just different.

But unfortunately it's not really usable for some prompts, namely when there are humans in the image. I don't know anything about VAE architecture, but in my naiveté, I was wondering if there might be a compromise between these two VAEs - one that maybe takes ~30% of the time of sdxl-vae-fp16-fix (compared to this model's ~5% of the time), while producing images that much more closely resemble those of the default VAE.

Do you know if something like this exists? If it doesn't, is it something you've thought about creating? I think it would be a huge boon to the community to have a VAE that was somewhere between this and the default in terms of quality and speed, something that could be a real "go-to" VAE for all real (not just previewing) generations.

I also found this post, under which the first commenter argues that there are some improvements that could be made to the default SDXL VAE, some of which I'm guessing you may have actually implemented in TAE. Perhaps they could help maintain the same level of quality with less inference time (again, this is all over my head, just wanted to share): https://news.ycombinator.com/item?id=39215242

Thanks again for the great work here!

wondering if there might be a compromise between these two VAEs

Yeah, there's a potential range of different models you could train with different quality / speed tradeoffs! It's also likely we could get better quality at TAESD speeds just by investing more developer time / compute & modifying the architecture more. Anything which is substantially better than TAESD in quality (at identical speed) or substantially better than SDVAE in speed (at identical quality) would probably get community adoption (as would anything which is better in both axes).

Do you know if something like this exists? If it doesn't, is it something you've thought about creating?

I believe @IDKiro trained a better tiny VAE for SDXS-1024 (but I don't think they've released it). I've considered training TAESD++ models but so far I've been too lazy :)

I also found this post, under which the first commenter argues that there are some improvements that could be made to the default SDXL VAE

I did fix 1 (eliminating the magic number) in TAESD/TAESDXL, and I'm also not using a vanilla PatchGAN these days (addressing 3). The ideas in 2 and 4 are about changing the encoder which you can't do to SDXL-VAE without breaking compatibility with existing models - those ideas are reasonable but also have tradeoffs that the author doesn't mention (more channels means less compression and makes the latents harder to generate; applying more regularization on the latent space necessarily reduces your compression ratio)

We did develop the Tiny VAE for internal use, but did not release it at this time because of some design coupling to hardware. Relative to TAESD, I think two modifications may be key:

  1. we replaced the upsampling layer of the decoder with a bilinear function, which we found reduces the probability of artifacts appearing;
  2. we use ProjectedGAN to train the decoder, which can improve the training stability.
    The code for the ProjectedGAN we used will be released maybe within a week if it goes well.
    In fact, I'm very interested in increasing the number of channels of latent code, since that's what SD3 does.

Just curious, what was the approximate training time/compute cost for TAESDXL? And what was the general training process?

@paul-stamets No idea about time (rather than training one checkpoint from scratch, I've typically run a bunch of short training/finetuning runs with various parameters and released the best-looking candidate). Since everything I'm training is on 1xA10 or weaker, though, the cost is probably <$250 to replicate even if you were to train entirely from scratch. I'm using my own (messy, private) training code, but I've posted some notes about the process here https://github.com/madebyollin/taesd/issues/11#issuecomment-1914990359 as well as links to better-tested VAE training codebases like https://github.com/mosaicml/diffusion/pull/79 which should work out of the box.

@paul-stamets I posted a basic training notebook this weekend https://github.com/madebyollin/seraena if you want to try running it on your own dataset (haven't thoroughly tested it though, so YMMV 😅).

Sign up or log in to comment