Weird attention heads number in the VAE

#201
by dropout05 - opened

Hi! I today noticed that two of the attention layers in the VAE only have one head and head size is 512.
This is a really non-standard hyperparameter choice and it also is incompatible with Flash attention.

Specifically, I'm talking about vae.encoder.mid_block.attentions.0 and vae.encoder.mid_block.attentions.1

It really smells like a bug and I want to understand if it is consistent with the training code.

+1 I cannot find anywhere why only 1 head is needed... Should we at least try 4-8 heads...

the code is really clean though so I cannot imagine it is a mistake

Sign up or log in to comment