Why are there some "dense_h_to_4h" and "dense_4h_to_h" layers without any activation layers in between ?

#64
by Tombriss - opened

Mathematically, putting two dense layers that follow each other without an activation layer in between is equivalent to a single one with fewer parameters (just affine composition). So I am surprised to see these linear layers "dense_h_to_4h" and "dense_4h_to_h" following each other in the model (at least, that's what pytorch shows).

What do I miss ? Is that equivalent during inference but not during training or something like that ?

Thanks !

BigScience Workshop org

Hi @Tombriss

Thanks for your message !
In the current implementation we do have a custom GeLU function between those 2 layers, please check this line: https://github.com/huggingface/transformers/blob/ee67e7ad4fd7a766891b68f708cf03e30f609976/src/transformers/models/bloom/modeling_bloom.py#L360

Hi @ybelkada , thanks a lot for your quick answer. Indeed, I should have inspected the code before asking any question, sorry for that :)
I was just puzzled when seeing pytorch modules printing :
image.png

It's now clear :)

BigScience Workshop org

No worries! I agree this is a bit confusing for users. I have made a Pull Request here: https://github.com/huggingface/transformers/pull/18312 that should solve this problem

BigScience Workshop org

Closing as @ybelkada 's PR was merged :-)

cakiki changed discussion status to closed

Sign up or log in to comment