sliding_window = 131072? Sliding window attention doesn't work for 128?

#4
by keyishen - opened

Does it mean yarn-mistral version discards the sliding window attention used in Mistral-7B?

keyishen changed discussion title from sliding_window = 131072? to sliding_window = 131072? Sliding window attention doesn't work for 128?
NousResearch org
edited Nov 3, 2023

Sliding window does work, but in order to take advantage of the full 128k context, you should set it to 128k. Smaller windows will lower VRAM requirements but will degrade PPL and reduce the context size that the model can retrieve information from.

Sign up or log in to comment