Weird attention heads number in the VAE

#201

by dropout05 - opened Jan 30

Jan 30

Hi! I today noticed that two of the attention layers in the VAE only have one head and head size is 512.
This is a really non-standard hyperparameter choice and it also is incompatible with Flash attention.

Specifically, I'm talking about vae.encoder.mid_block.attentions.0 and vae.encoder.mid_block.attentions.1

It really smells like a bug and I want to understand if it is consistent with the training code.

Bhllllll

Apr 11

+1 I cannot find anywhere why only 1 head is needed... Should we at least try 4-8 heads...

the code is really clean though so I cannot imagine it is a mistake

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment