Default to eager attention

#1
by lysandre HF staff - opened

Hello! You may want to do that as well to have eager attention by default for third-party users using your model :)

https://huggingface.co/google/gemma-2-27b-it/discussions/22

Thank you for your suggestion!

However, we've discovered that fine-tuning the Gemma2 models with eager attention and context length 8192, even the smallest 9B model, requires an excessively large amount of VRAM. This makes it unfeasible. Consequently, we are opting to use flash attention for fine-tuning.

Should a fully compatible flash attention implementation for the Gemma2 model become available, we will consider retraining a new version of the Gemma2 Chinese models.

shenzhi-wang pinned discussion

都是在ollama上面运行:
时间推理题:如果今天是星期二,72小时后是星期几?此模型回答:星期日。未调优的能回答正确。
空间推理题: 一个立方体的每个面都涂上了颜色。如果将此立方体切成27个相同的小立方体,最多有多少个小立方体的表面没有颜色?此模型回答:0。未调优的能回答正确:1。

Sign up or log in to comment