Sliding window vs. Global Attention

#41

by tanliboy - opened Aug 21

Discussion

tanliboy

Aug 21

Since vLLM currently lacks sliding window support, how does this affect the model's performance?"

GopiUppari

Google org Aug 22

Hi @tanliboy ,

Let's Model 1 uses sliding window attention for every odd layer, Model 2 ignores it and uses global attention for all layers.

For an short articles:

Model 1 might perform similarly to Model 2 on shorter articles because the sliding window attention can effectively capture most or all relevant dependencies within the article.
Model 2 may have a slight edge if even short articles require linking distant information, but the difference might be negligible.

For an Long articles:

Model 1 could struggle with long articles where important information is spread out. The sliding window might miss crucial connections, leading to summaries that overlook key points.
Model 2 would likely outperform Model 1 by generating summaries that account for the entire article, identifying and linking information across different sections.

Thank you.

tanliboy

Aug 22

@GopiUppari Thanks for the detailed explanation of the sliding window!
I wanted to clarify my question: Given that vLLM doesn't support sliding windows and Gemma 2 models require them on odd layers, would serving Gemma 2 models with vLLM lead to suboptimal performance for inputs exceeding 4096 tokens?
I understand it could increase computational cost and memory usage, but I'm unsure whether the results of (Q * K^t) outside the sliding window would be masked out or still influence the final output.

tanliboy

Aug 27

I found it out based on this warning:

 utils.py:721] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).

With vLLM, the max token length will be reduced from 8k to 4k to fit into the length of the sliding window.

Sangsang

Sep 3

Hi @tanliboy , in order to fully utilize the original context length which gemma has been trained on, other serving tools (e.g. sglang) would be a great choice :)

tanliboy

Sep 3

Thank you, @Sangsang ! I cannot wait to try out the 6x performance improvement of SGLang.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment