What is the global attention span and siding window ?

#23
by ducknificient - opened

Gemma 2 paper mention the global attention span is 8192 and sliding window 4096

but what about Gemma (first generation) ?

best regards,

Google org

Hi @ducknificient ,

There is no specific mention of global attention span or sliding window in the Gemma 1 model. The official Gemma paper refers to this concept as a context length, which is set at 8192 tokens. You can find more details in the paper here.

Thank you.

ducknificient changed discussion status to closed

Sign up or log in to comment