why put MistralRotaryEmbedding in each attention layer instead of putting only once before the first attention layer?

#91

by liougehooa - opened Apr 16

Apr 16

I found Mistral and some LLMS put positional Embedding in each attention layer(Transformer block) instead. Initially in transformer, the network only has one the first attention layer for each Transformer(one for encoder, one for decoder)?
Why is it better?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment