Qwen
/

关于FlashAttention无法在Teslava100上运行

#37
by dkwwww - opened

单卡teslav100,测试qwen-7b模型时,设置use_flash_attn=True后,报错如下:
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm

并且最后也无法正常推理,报错如下:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
是Flash-attention只能在特定架构的gpus上运行吗?

Qwen org

flash attention是一个用于加速模型训练推理的可选项,且仅适用于Turing、Ampere、Ada、Hopper架构的Nvidia GPU显卡(如H100、A100、RTX 3090、T4、RTX 2080),您可以在不安装flash attention的情况下正常使用模型进行推理。

更多问题,请见我们GitHub仓库中的FAQ部分,感谢支持!

jklj077 changed discussion status to closed

我使用的是aws ec2,类型是 g4dn(显卡是T4),也出现上述错误,先是提示部分组件使用cpu,最后提示:FlashAttention only supports Ampere GPUs or newer,最后将flash attention卸载了,才可以正常使用

Sign up or log in to comment