Default to eager attention

pinned

by lysandre HF staff - opened Jul 3, 2024

Discussion

lysandre

Jul 3, 2024

Hello! You may want to do that as well to have eager attention by default for third-party users using your model :)

https://huggingface.co/google/gemma-2-27b-it/discussions/22

shenzhi-wang

Owner Jul 3, 2024

•

edited Jul 4, 2024

Thank you for your suggestion!

However, we've discovered that fine-tuning the Gemma2 models with eager attention and context length 8192, even the smallest 9B model, requires an excessively large amount of VRAM. This makes it unfeasible. Consequently, we are opting to use flash attention for fine-tuning.

Should a fully compatible flash attention implementation for the Gemma2 model become available, we will consider retraining a new version of the Gemma2 Chinese models.

shenzhi-wang pinned discussion Jul 3, 2024

xvweirong

Jul 14, 2024

•

edited Jul 14, 2024

都是在ollama上面运行：
时间推理题：如果今天是星期二，72小时后是星期几？此模型回答：星期日。未调优的能回答正确。
空间推理题：一个立方体的每个面都涂上了颜色。如果将此立方体切成27个相同的小立方体，最多有多少个小立方体的表面没有颜色？此模型回答：0。未调优的能回答正确：1。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment