How does v0.2 manages to support 32k token context without Sliding Window Attention?

#85

by Andriy - opened Apr 11, 2024

Discussion

Andriy

Apr 11, 2024

•

edited Apr 11, 2024

How does v2 manage to have a raw 32k context size without sliding window? Full attention has quadratic space complexity. 32k x 32k would require too much space no GPU would support. Even with FlashAttention 2, 8k seems to be the limit for other models. How does it work?

singhay

Apr 14, 2024

It uses RoPE scaling

Andriy

Apr 14, 2024

@singhay This doesn't solve the quadratic space and time complexity of attention in transformers.

singhay

Apr 15, 2024

You're right that no it does not, it's a hack to improve how the positions are embedded of a sequence. Instead of 1,2,3... it's more advance like 1, 1.25, 1.5, 1.75, 2 -> 4x more positions that can be incorporate.

MCH-C

May 21, 2024

Any idea why Sliding Window Attention has been abandoned

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment