Weird attention heads number in the VAE
#201
by
dropout05
- opened
Hi! I today noticed that two of the attention layers in the VAE only have one head and head size is 512.
This is a really non-standard hyperparameter choice and it also is incompatible with Flash attention.
Specifically, I'm talking about vae.encoder.mid_block.attentions.0
and vae.encoder.mid_block.attentions.1
It really smells like a bug and I want to understand if it is consistent with the training code.
+1 I cannot find anywhere why only 1 head is needed... Should we at least try 4-8 heads...
the code is really clean though so I cannot imagine it is a mistake