How does v0.2 manages to support 32k token context without Sliding Window Attention?

#85
by Andriy - opened

How does v2 manage to have a raw 32k context size without sliding window? Full attention has quadratic space complexity. 32k x 32k would require too much space no GPU would support. Even with FlashAttention 2, 8k seems to be the limit for other models. How does it work?

It uses RoPE scaling

@singhay This doesn't solve the quadratic space and time complexity of attention in transformers.

You're right that no it does not, it's a hack to improve how the positions are embedded of a sequence. Instead of 1,2,3... it's more advance like 1, 1.25, 1.5, 1.75, 2 -> 4x more positions that can be incorporate.

Any idea why Sliding Window Attention has been abandoned

Sign up or log in to comment