Choice of positional encodings?
#17
by
hankcs
- opened
Hi TII researchers,
Thank you for sharing your great work. After reading your RW paper, I got a question regarding your choice of positional encodings. In your paper, ALiBi was used to train RW-1B/7B. However, your ultimate models (falcon-7B/40B) were using rotary. According to the ALiBi and XPos papers, ALiBi outperforms rotary much, especially when applied to context longer than the pretrain limit. So, could you explain why you revert ALiBi to rotary in your final large-scale pretraining?
Hey @hankcs ,
This is another one of these somewhat arbitrary decisions :).
- We experimented with both rotary and ALiBi early on in the project (that's why the RW models use it!);
- We found rotary to consistently improve downstream zero-shot performance + autoregressive loss, but it was a bit slower than ALiBi;
- However, with a custom Triton kernel for FlashAttention+Rotary, we were able to close much of that gap;
- We also did extrapolation experiments, and with a sliding window we found Rotary & ALiBi to perform similarly -- however, we did not had good long-dependence tasks, so we suspect more work is needed here;
- We experimented with finetuning to longer sequence lengths with Rotary and it worked fine.
Stay tuned, we are actually interested in sharing more thoughts and principled experiments about positional embeddings in the future.
FalconLLM
changed discussion status to
closed
Any updates on this?