Gemma 2's Flash attention 2 implementation is strange...
I tested with torch.manual_seed(0).
eager attention
=> normal resultflash attention 2
=> 1's not. The 2's "to be's for.3' for4. 2 That 4 2 the 4 that 4 for. 4's 4' to 4''' the 4'' to. 4' 4 4 4to lose to. 4 the' 4 4 4' 4' 4 the 4 the 4 4 4 ...
It is almost the same without any attention
With "eager", it works good.
yes, it should be fixed when you install new version of flash attention from source.
I installed it yesterday 😅
And on windows, so it took a few hours 😨
pip freeze | findstr flash-attn
flash-attn==2.5.9.post1
Took 2 hours, but finally installed flash-attention >= 2.6.0
It works just by changing these lines, it is a bit slower than without flash attention and it use the same amount of memory.
Maybe there is still something broken.
It does output good response.
Started process with the eager attn_implementation.
The eager attn_implementation took 15.17s to infer {tokens} tokens.
Started process with the sdpa attn_implementation.
The sdpa attn_implementation took 21.51s to infer {tokens} tokens.
Started process with the flash_attention_2 attn_implementation.
The flash_attention_2 attn_implementation took 30.53s to infer {tokens} tokens.
Yes, something very wrong. Probably won't be fixed.
This might fix this, at least memory part: https://github.com/huggingface/transformers/pull/31292
I know, but we need to ask to apply it to gemma2, not only in gemma (1).
All Gemmas are is included, as far as I know.
I looked at the commits, and it changed the global generation utils AND for "the most used models" that includes Gemma (1), not Gemma 2.
Gemma2 was not released yet when I started this, but don't worry I will add it as well, it's on the roadmap 🤗