why put MistralRotaryEmbedding in each attention layer instead of putting only once before the first attention layer?

#91
by liougehooa - opened

I found Mistral and some LLMS put positional Embedding in each attention layer(Transformer block) instead. Initially in transformer, the network only has one the first attention layer for each Transformer(one for encoder, one for decoder)?
Why is it better?

Sign up or log in to comment