Update README of branch dev_triton.

#11

by Cheshire94 - opened Dec 26, 2023

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

+192

-9

Files changed (6) hide show

README.md +28 -1
assets/wechat.png +0 -0
config.json +1 -0
configuration_qwen.py +2 -0
modeling_qwen.py +36 -8
triton_kernels.py +125 -0

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ inference: false
 <p align="center">
         🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
 <br>
-<a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp ｜  &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
 </p>
 <br>
@@ -67,6 +67,14 @@ cd flash-attention && pip install .
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
 <br>
@@ -140,6 +148,25 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
 Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
 ### 显存使用 (GPU Memory Usage)
 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。（显存消耗在是否使用FlashAttn的情况下均类似。）结果如下所示：

 <p align="center">
         🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp ｜ &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
 <br>
+<a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp ｜  &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
 </p>
 <br>
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
+如果您有更高推理性能方面的需求，但上述可选加速项`layer_norm`及`rotary`未能安装成功，或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构，您可以尝试使用该分支下基于Triton进行实现的推理加速方案。该方案适用于更宽范围的GPU产品，且无需安装。您可以通过将config.json里的`use_triton`设置为true来进行启用。
+**(在dev_triton分支下`use_triton`默认设置为auto，由于pytorch 2.0及以上版本已默认安装了Triton，因此上述优化方案无需其它安装与配置操作即可直接启用。如果您不想开启该优化，请将config.json里的`use_triton`设置为false)**
+If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require installation. You can enable this acceleration feature by setting the `use_triton` option to true in the config.json file.
+**(In the dev_triton branch, `use_triton` is set to 'auto' by default. As Triton is pre-installed with pytorch version 2.0 and above, this acceleration solution can be enabled directly without additional installation or configuration. If you prefer not to activate this optimization, please set `use_triton` to false in the config.json file.)**
 <br>
 Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
+另外，我们也测算了在使用不同GPU及推理加速方法时Qwen-7B-Chat-Int4模型生成2048和8192个token的平均推理速度。所有评测均使用PyTorch 2.1.0和CUDA 11.8。
+In addition, we also measured the average inference speed of generating 2048 and 8192 tokens with different GPU devices and acceleration methods, respectively. All results run with PyTorch 2.1.0 and CUDA 11.8.
+| GPU Device | Method       | Speed (2048 tokens) | Speed (8192 tokens) |
+| :--------: | :----------: | :------------------:| :------------------:|
+|  A10       | FlashAttn v2 | 41.28               | 30.78               |
+|  A10       | Triton       | 49.04               | 29.17               |
+|  A10       | Disabled     | 39.26               | 26.81               |
+|  V100      | FlashAttn v2 | N/A                 | N/A                 |
+|  V100      | Triton       | 37.01               | 27.66               |
+|  V100      | Disabled     | 24.47               | 20.40               |
+|  P100      | FlashAttn v2 | N/A                 | N/A                 |
+|  P100      | Triton       | 29.03               | 13.85               |
+|  P100      | Disabled     | 20.50               | 12.73               |
+|  T4        | FlashAttn v2 | N/A                 | N/A                 |
+|  T4        | Triton       | 27.98               | 15.22               |
+|  T4        | Disabled     | 23.11               | 14.55               |
 ### 显存使用 (GPU Memory Usage)
 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。（显存消耗在是否使用FlashAttn的情况下均类似。）结果如下所示：

assets/wechat.png CHANGED Viewed

config.json CHANGED Viewed

@@ -44,6 +44,7 @@
   "use_cache": true,
   "use_dynamic_ntk": true,
   "use_flash_attn": "auto",
   "use_logn_attn": true,
   "vocab_size": 151936
 }

   "use_cache": true,
   "use_dynamic_ntk": true,
   "use_flash_attn": "auto",
+  "use_triton": "auto",
   "use_logn_attn": true,
   "vocab_size": 151936
 }

configuration_qwen.py CHANGED Viewed

@@ -32,6 +32,7 @@ class QWenConfig(PretrainedConfig):
         use_dynamic_ntk=True,
         use_logn_attn=True,
         use_flash_attn="auto",
         intermediate_size=22016,
         no_bias=True,
         tie_word_embeddings=False,
@@ -61,6 +62,7 @@ class QWenConfig(PretrainedConfig):
         self.use_dynamic_ntk = use_dynamic_ntk
         self.use_logn_attn = use_logn_attn
         self.use_flash_attn = use_flash_attn
         self.no_bias = no_bias
         self.use_cache_quantization = use_cache_quantization
         self.use_cache_kernel = use_cache_kernel

         use_dynamic_ntk=True,
         use_logn_attn=True,
         use_flash_attn="auto",
+        use_triton="auto",
         intermediate_size=22016,
         no_bias=True,
         tie_word_embeddings=False,
         self.use_dynamic_ntk = use_dynamic_ntk
         self.use_logn_attn = use_logn_attn
         self.use_flash_attn = use_flash_attn
+        self.use_triton = use_triton
         self.no_bias = no_bias
         self.use_cache_quantization = use_cache_quantization
         self.use_cache_kernel = use_cache_kernel

modeling_qwen.py CHANGED Viewed

@@ -35,7 +35,7 @@ except ImportError:
 from torch import nn
 SUPPORT_CUDA = torch.cuda.is_available()
-SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported()
 SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
 SUPPORT_TORCH2 = hasattr(torch, '__version__') and int(torch.__version__.split(".")[0]) >= 2
@@ -76,7 +76,9 @@ We detect you have activated flash attention support, but running model computat
 """
 apply_rotary_emb_func = None
 rms_norm = None
 flash_attn_unpadded_func = None
 flash_attn_func = None
@@ -120,6 +122,24 @@ def _import_flash_attn():
             "https://github.com/Dao-AILab/flash-attention"
         )
 def quantize_cache_v(fdata, bits, qmax, qmin):
     # b, s, head, h-dim->b, head, s, h-dim
     qtype = torch.uint8
@@ -520,11 +540,9 @@ class QWenAttention(nn.Module):
             if not self.use_cache_quantization and SUPPORT_TORCH2:
                 if attention_mask is not None:
-                    attention_mask = attention_mask.expand(
-                        -1, -1, causal_mask.size(2), -1
-                    )
                     if causal_mask is not None:
-                        attention_mask.masked_fill_(~causal_mask, torch.finfo(query.dtype).min)
                 else:
                     attention_mask = causal_mask
                 attn_output = F.scaled_dot_product_attention(
@@ -978,6 +996,12 @@ class QWenLMHeadModel(QWenPreTrainedModel):
         if config.use_flash_attn:
             _import_flash_attn()
         self.transformer = QWenModel(config)
         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
@@ -1330,12 +1354,14 @@ def apply_rotary_pos_emb(t, freqs):
       t (tensor(batch_size, seq_len, n_head, head_dim)):
         the input embedding/hidden states
       freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
-        the cached cos/sin position embeddings
     """
     rot_dim = freqs[0].shape[-1]
     cos, sin = freqs
     t_float = t.float()
-    if apply_rotary_emb_func is not None and t.is_cuda:
         # apply_rotary_emb in flash_attn requires cos/sin to be of
         # shape (seqlen, rotary_dim / 2) and apply rotary embedding
         # to the first rotary_dim of the input
@@ -1358,7 +1384,9 @@ class RMSNorm(torch.nn.Module):
         return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
     def forward(self, x):
-        if rms_norm is not None and x.is_cuda:
             return rms_norm(x, self.weight, self.eps)
         else:
             output = self._norm(x.float()).type_as(x)

 from torch import nn
 SUPPORT_CUDA = torch.cuda.is_available()
+SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 8
 SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
 SUPPORT_TORCH2 = hasattr(torch, '__version__') and int(torch.__version__.split(".")[0]) >= 2
 """
 apply_rotary_emb_func = None
+apply_rotary_emb_func_triton = None
 rms_norm = None
+rms_norm_triton = None
 flash_attn_unpadded_func = None
 flash_attn_func = None
             "https://github.com/Dao-AILab/flash-attention"
         )
+def _import_triton():
+    global apply_rotary_emb_func_triton, rms_norm_triton
+    try:
+        from .triton_kernels import apply_rotary_emb as __apply_rotary_emb, rms_norm as __rms_norm
+        if apply_rotary_emb_func is not None:
+            logger.warn(
+                "Using Triton rotary kernel instead of flash_attn for inference."
+            )
+        apply_rotary_emb_func_triton = __apply_rotary_emb
+        if rms_norm is not None:
+            logger.warn(
+                "Using Triton rms_norm kernel instead of flash_attn for inference."
+            )
+        rms_norm_triton = __rms_norm
+    except ImportError:
+        logger.warn("Warning: Failed to import Triton kernels.")
+        return
 def quantize_cache_v(fdata, bits, qmax, qmin):
     # b, s, head, h-dim->b, head, s, h-dim
     qtype = torch.uint8
             if not self.use_cache_quantization and SUPPORT_TORCH2:
                 if attention_mask is not None:
+                    attention_mask = attention_mask.expand(-1, -1, query.size(2), -1)
                     if causal_mask is not None:
+                        attention_mask = attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
                 else:
                     attention_mask = causal_mask
                 attn_output = F.scaled_dot_product_attention(
         if config.use_flash_attn:
             _import_flash_attn()
+        if config.use_triton == "auto":
+            logger.warn("Try importing Triton kernels for faster inference...")
+            config.use_triton = SUPPORT_TORCH2
+        if config.use_triton:
+            _import_triton()
         self.transformer = QWenModel(config)
         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
       t (tensor(batch_size, seq_len, n_head, head_dim)):
         the input embedding/hidden states
       freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
+        the cached cos/sin position embeddings
     """
     rot_dim = freqs[0].shape[-1]
     cos, sin = freqs
     t_float = t.float()
+    if apply_rotary_emb_func_triton is not None and t.is_cuda and (not t.requires_grad):
+        return apply_rotary_emb_func_triton(t, cos, sin)
+    elif apply_rotary_emb_func is not None and t.is_cuda:
         # apply_rotary_emb in flash_attn requires cos/sin to be of
         # shape (seqlen, rotary_dim / 2) and apply rotary embedding
         # to the first rotary_dim of the input
         return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
     def forward(self, x):
+        if rms_norm_triton is not None and x.is_cuda and (not x.requires_grad):
+            return rms_norm_triton(x, self.weight, self.eps)
+        elif rms_norm is not None and x.is_cuda:
             return rms_norm(x, self.weight, self.eps)
         else:
             output = self._norm(x.float()).type_as(x)

triton_kernels.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# This module provides ApplyRoPE and RMSNorm kernels written in OpenAI Triton.
+# Feel free to contact the contributors if you have any questions or issues regarding this code.
+# Contributors: Shangming Cai, Zihan Wang
+# Contacts: csmthu@gmail.com, wzh1999_frog@126.com
+from typing import Any, Callable, Dict, Hashable, Tuple
+import torch
+import triton
+import triton.language as tl
+from triton.compiler import CompiledKernel
+from triton.runtime import JITFunction
+try:
+    import triton.language.math as tlmath  # Triton 2.1
+except ImportError:
+    import triton.language.libdevice as tlmath  # Triton 2.0
+class TritonKernel:
+    def __init__(
+        self,
+        kernel_fn: JITFunction,
+        grid_fn: Callable[[Tuple[Any, ...]], Tuple[int, int, int]],
+    ) -> None:
+        self.kernel_fn_ = kernel_fn
+        self.grid_fn_ = grid_fn
+        self.kernel_cache_: Dict[Hashable, CompiledKernel] = {}
+    def run(self, *args, **kwargs):
+        # Set current device
+        input_device = args[0].device
+        prev_dev_idx, cur_dev_idx = -1, torch.cuda.current_device()
+        if input_device.index != cur_dev_idx:
+            prev_dev_idx = cur_dev_idx
+            torch.cuda.set_device(input_device.index)
+        # Compute grid
+        grid = self.grid_fn_(args)
+        # Use cached kernel if possible
+        kernel_key = (input_device,) + tuple(kwargs.items())
+        if kernel_key in self.kernel_cache_:
+            kernel = self.kernel_cache_[kernel_key]
+            kernel[grid](*args)
+        else:
+            # Compile and store new kernel
+            kernel = self.kernel_fn_[grid](*args, **kwargs)
+            self.kernel_cache_[kernel_key] = kernel
+        # Restore previous device
+        torch.cuda.set_device(prev_dev_idx)
+@triton.jit
+def _apply_rope_fwd_kernel(X, Cos, Sin, Y, HEAD_DIM: tl.constexpr):
+    batch_idx, tok_idx, head_idx = tl.program_id(0), tl.program_id(1), tl.program_id(2)
+    seq_len, num_heads = tl.num_programs(1), tl.num_programs(2)
+    block_idx = tl.arange(0, HEAD_DIM)
+    x_base_idx = ((batch_idx * seq_len + tok_idx) * num_heads * 3 + head_idx) * HEAD_DIM
+    x = tl.load(X + x_base_idx + block_idx)
+    freq_idx = tok_idx * HEAD_DIM + block_idx
+    cos = tl.load(Cos + freq_idx)
+    rot_idx = (HEAD_DIM // 2 + block_idx) % HEAD_DIM
+    x_rot = tl.load(X + x_base_idx + rot_idx)
+    x_rot = tl.where(block_idx >= HEAD_DIM // 2, x_rot, -x_rot)
+    sin = tl.load(Sin + freq_idx)
+    y_idx = (
+        (batch_idx * seq_len + tok_idx) * num_heads + head_idx
+    ) * HEAD_DIM + block_idx
+    y = x * cos + x_rot * sin
+    tl.store(Y + y_idx, y.to(Y.dtype.element_ty))
+apply_rope_fwd_kernel = TritonKernel(
+    _apply_rope_fwd_kernel, lambda args: tuple(args[0].shape[:3])
+)
+def apply_rotary_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
+    y = torch.empty(x.shape, dtype=x.dtype, device=x.device)
+    apply_rope_fwd_kernel.run(x, cos, sin, y, HEAD_DIM=x.size(-1))
+    return y
+@triton.jit
+def _rms_norm_fwd_kernel(X, W, Y, eps, hidden_dim, BLOCK_SIZE: tl.constexpr):
+    tok_idx = tl.program_id(0)
+    mean_sq = tl.zeros([BLOCK_SIZE], tl.float32)
+    for offset in range(0, hidden_dim, BLOCK_SIZE):
+        dim_idx = offset + tl.arange(0, BLOCK_SIZE)
+        x = tl.load(
+            X + tok_idx * hidden_dim + dim_idx, mask=dim_idx < hidden_dim, other=0
+        ).to(tl.float32)
+        mean_sq += x * x / hidden_dim
+    rrms = tlmath.rsqrt(tl.sum(mean_sq, 0) + eps)
+    for offset in range(0, hidden_dim, BLOCK_SIZE):
+        dim_idx = offset + tl.arange(0, BLOCK_SIZE)
+        dim_mask = dim_idx < hidden_dim
+        hidden_idx = tok_idx * hidden_dim + dim_idx
+        x = tl.load(X + hidden_idx, mask=dim_mask, other=0)
+        w = tl.load(W + dim_idx, mask=dim_mask, other=0)
+        y = x * rrms * w
+        tl.store(Y + hidden_idx, y.to(Y.dtype.element_ty), mask=dim_mask)
+rms_norm_fwd_kernel = TritonKernel(
+    _rms_norm_fwd_kernel, lambda args: (args[0].shape[:-1].numel(), 1, 1)
+)
+def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float):
+    y = torch.empty_like(x)
+    hidden_dim = x.size(-1)
+    rms_norm_fwd_kernel.run(
+        x, weight, y, eps, hidden_dim, BLOCK_SIZE=triton.next_power_of_2(hidden_dim)
+    )
+    return y