GemmaSdpaAttention vs GemmaAttention

#71

by canqin001 - opened Aug 30

Aug 30

Hi. I have loaded the gemma-2b model with two different machines. One is "GemmaSdpaAttention" and the other is "GemmaSdpaAttention". The results of these two are different despite that I used the same ckpt. Does anyone have the similar problem and know the reason? Thanks!

canqin001 changed discussion status to closed Aug 30

tanliboy

Aug 30

@canqin001 out of curiosity, have you found the root cause?

canqin001

Aug 30

•

edited Aug 30

I have not found the reason but I found the solution. Please refer to https://github.com/huggingface/transformers/blob/c409cd81777fb27aadc043ed3d8339dbc020fb3b/src/transformers/models/gemma/modeling_gemma.py#L558
We can use "_attn_implementation" to decide which version to use.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment