Activation function
#56
by
aboros98
- opened
Hello!
The model should use exact gelu or approximate gelu as a gating function in mlp?
I am asking this because in PyTorch and HF the model uses exact gelu, while in keras and JAX approximate gelu is used.
Thanks!
Hi
@aboros98
It should be approximate gelu I think, see: https://twitter.com/danielhanchen/status/1763613620909580505