Qwen/Qwen-1_8B-Chat-Int8 · ValueError: QWenLMHeadModel does not support Flash Attention 2.0 yet.

Mar 8

Will there be flash_attention implementation in this model

Qwen org Mar 13

Qwen(1.0) uses custom code and automatically enable flash-attention (v2) and you don't need to pass the argument when loading the model. Related info will be printed to logs. See the following for more info: https://github.com/QwenLM/Qwen

sanjeev-bhandari01

Mar 14

•

edited May 15

Environment: Google Colab
GPU Info:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0              30W /  70W |  12151MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Error traceback:

RuntimeError                              Traceback (most recent call last)
<ipython-input-3-627d21222f84> in <cell line: 1>()
----> 1 response, history = model.chat(tokenizer, "How are you", history=None)
      2 print(response)

26 frames
/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
     49     maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
     50     q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
     52         q,
     53         k,
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Now I am facing this problem? Thank You.

Explanation Updated:

Google Colab Free tier GPU doesnot support Flash-Attention