Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton to `dev_triton` branch

by wangzihan99 - opened Dec 5, 2023

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

-12

Files changed (3) hide show

README.md +1 -8
assets/wechat.png +0 -0
modeling_qwen.py +6 -4

README.md CHANGED Viewed

@@ -6,9 +6,6 @@ tags:
 - qwen
 pipeline_tag: text-generation
 inference: false
-license: other
-license_name: tongyi-qianwen-license-agreement
-license_link: https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
 ---
 # Qwen-7B-Chat-Int4
@@ -21,7 +18,7 @@ license_link: https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENS
 <p align="center">
         🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
 <br>
-<a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp ｜  &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
 </p>
 <br>
@@ -70,10 +67,6 @@ cd flash-attention && pip install .
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
-如果您有更高推理性能方面的需求，但上述可选加速项`layer_norm`及`rotary`未能安装成功，或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构，您可以尝试切换至dev_triton分支，使用该分支下基于Triton实现的推理加速方案。该方案适用于更宽范围的GPU产品，在pytorch 2.0及以上版本原生支持，无需额外安装操作。
-If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may switch to the dev_triton branch and consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require extra package installation with pytorch version 2.0 and above.
 <br>

 - qwen
 pipeline_tag: text-generation
 inference: false
 ---
 # Qwen-7B-Chat-Int4
 <p align="center">
         🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
 <br>
+<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp ｜  &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
 </p>
 <br>
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
 <br>

assets/wechat.png CHANGED Viewed

modeling_qwen.py CHANGED Viewed

@@ -520,7 +520,9 @@ class QWenAttention(nn.Module):
             if not self.use_cache_quantization and SUPPORT_TORCH2:
                 if attention_mask is not None:
-                    attention_mask = attention_mask.expand(-1, -1, query.size(2), -1)
                     if causal_mask is not None:
                         attention_mask = attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
                 else:
@@ -1328,14 +1330,14 @@ def apply_rotary_pos_emb(t, freqs):
       t (tensor(batch_size, seq_len, n_head, head_dim)):
         the input embedding/hidden states
       freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
-        the cached cos/sin position embeddings
     """
     rot_dim = freqs[0].shape[-1]
     cos, sin = freqs
     t_float = t.float()
     if apply_rotary_emb_func is not None and t.is_cuda:
-        # apply_rotary_emb in flash_attn requires cos/sin to be of
-        # shape (seqlen, rotary_dim / 2) and apply rotary embedding
         # to the first rotary_dim of the input
         cos = cos.squeeze(0).squeeze(1)[:, : rot_dim // 2]
         sin = sin.squeeze(0).squeeze(1)[:, : rot_dim // 2]

             if not self.use_cache_quantization and SUPPORT_TORCH2:
                 if attention_mask is not None:
+                    attention_mask = attention_mask.expand(
+                        -1, -1, causal_mask.size(2), -1
+                    )
                     if causal_mask is not None:
                         attention_mask = attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
                 else:
       t (tensor(batch_size, seq_len, n_head, head_dim)):
         the input embedding/hidden states
       freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
+        the cached cos/sin position embeddings
     """
     rot_dim = freqs[0].shape[-1]
     cos, sin = freqs
     t_float = t.float()
     if apply_rotary_emb_func is not None and t.is_cuda:
+        # apply_rotary_emb in flash_attn requires cos/sin to be of
+        # shape (seqlen, rotary_dim / 2) and apply rotary embedding
         # to the first rotary_dim of the input
         cos = cos.squeeze(0).squeeze(1)[:, : rot_dim // 2]
         sin = sin.squeeze(0).squeeze(1)[:, : rot_dim // 2]