Why are there some "dense_h_to_4h" and "dense_4h_to_h" layers without any activation layers in between ?
Mathematically, putting two dense layers that follow each other without an activation layer in between is equivalent to a single one with fewer parameters (just affine composition). So I am surprised to see these linear layers "dense_h_to_4h" and "dense_4h_to_h" following each other in the model (at least, that's what pytorch shows).
What do I miss ? Is that equivalent during inference but not during training or something like that ?
Thanks !
Hi @Tombriss
Thanks for your message !
In the current implementation we do have a custom GeLU function between those 2 layers, please check this line: https://github.com/huggingface/transformers/blob/ee67e7ad4fd7a766891b68f708cf03e30f609976/src/transformers/models/bloom/modeling_bloom.py#L360
No worries! I agree this is a bit confusing for users. I have made a Pull Request here: https://github.com/huggingface/transformers/pull/18312 that should solve this problem