GemmaSdpaAttention vs GemmaAttention
#71
by
canqin001
- opened
Hi. I have loaded the gemma-2b model with two different machines. One is "GemmaSdpaAttention" and the other is "GemmaSdpaAttention". The results of these two are different despite that I used the same ckpt. Does anyone have the similar problem and know the reason? Thanks!
canqin001
changed discussion status to
closed
@canqin001 out of curiosity, have you found the root cause?
I have not found the reason but I found the solution. Please refer to https://github.com/huggingface/transformers/blob/c409cd81777fb27aadc043ed3d8339dbc020fb3b/src/transformers/models/gemma/modeling_gemma.py#L558
We can use "_attn_implementation" to decide which version to use.