Shangming Cai commited on
Commit
575c4e9
1 Parent(s): 08c8530

Update README of branch dev_triton.

Browse files
Files changed (2) hide show
  1. README.md +27 -0
  2. triton_kernels.py +10 -0
README.md CHANGED
@@ -67,6 +67,14 @@ cd flash-attention && pip install .
67
  # pip install csrc/layer_norm
68
  # pip install csrc/rotary
69
  ```
 
 
 
 
 
 
 
 
70
  <br>
71
 
72
 
@@ -140,6 +148,25 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
140
 
141
  Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  ### 显存使用 (GPU Memory Usage)
144
 
145
  我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
 
67
  # pip install csrc/layer_norm
68
  # pip install csrc/rotary
69
  ```
70
+
71
+ 如果您有更高推理性能方面的需求,但上述可选加速项`layer_norm`及`rotary`未能安装成功,或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构,您可以尝试使用该分支下基于Triton进行实现的推理加速方案。该方案适用于更宽范围的GPU产品,且无需安装。您可以通过将config.json里的`use_triton`设置为true来进行启用。
72
+
73
+ **(在dev_triton分支下`use_triton`默认设置为auto,由于pytorch 2.0及以上版本已默认安装了Triton,因此上述优化方案无需其它安装与配置操作即可直接启用。如果您不想开启该优化,请将config.json里的`use_triton`设置为false)**
74
+
75
+ If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require installation. You can enable this acceleration feature by setting the `use_triton` option to true in the config.json file.
76
+
77
+ **(In the dev_triton branch, `use_triton` is set to 'auto' by default. As Triton is pre-installed with pytorch version 2.0 and above, this acceleration solution can be enabled directly without additional installation or configuration. If you prefer not to activate this optimization, please set `use_triton` to false in the config.json file.)**
78
  <br>
79
 
80
 
 
148
 
149
  Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.from_pretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available.
150
 
151
+ 另外,我们也测算了在使用不同GPU及推理加速方法时Qwen-7B-Chat-Int4模型生成2048和8192个token的平均推理速度。所有评测均使用PyTorch 2.1.0和CUDA 11.8。
152
+
153
+ In addition, we also measured the average inference speed of generating 2048 and 8192 tokens with different GPU devices and acceleration methods, respectively. All results run with PyTorch 2.1.0 and CUDA 11.8.
154
+
155
+ | GPU Device | Method | Speed (2048 tokens) | Speed (8192 tokens) |
156
+ | :--------: | :----------: | :------------------:| :------------------:|
157
+ | A10 | FlashAttn v2 | 41.28 | 30.78 |
158
+ | A10 | Triton | 49.04 | 29.17 |
159
+ | A10 | Disabled | 39.26 | 26.81 |
160
+ | V100 | FlashAttn v2 | N/A | N/A |
161
+ | V100 | Triton | 37.01 | 27.66 |
162
+ | V100 | Disabled | 24.47 | 20.40 |
163
+ | P100 | FlashAttn v2 | N/A | N/A |
164
+ | P100 | Triton | 29.03 | 13.85 |
165
+ | P100 | Disabled | 20.50 | 12.73 |
166
+ | T4 | FlashAttn v2 | N/A | N/A |
167
+ | T4 | Triton | 27.98 | 15.22 |
168
+ | T4 | Disabled | 23.11 | 14.55 |
169
+
170
  ### 显存使用 (GPU Memory Usage)
171
 
172
  我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示:
triton_kernels.py CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  from typing import Any, Callable, Dict, Hashable, Tuple
2
 
3
  import torch
 
1
+ # Copyright (c) Alibaba Cloud.
2
+ #
3
+ # This source code is licensed under the license found in the
4
+ # LICENSE file in the root directory of this source tree.
5
+
6
+ # This module provides ApplyRoPE and RMSNorm kernels written in OpenAI Triton.
7
+ # Feel free to contact the contributors if you have any questions or issues regarding this code.
8
+ # Contributors: Shangming Cai, Zihan Wang
9
+ # Contacts: csmthu@gmail.com, wzh1999_frog@126.com
10
+
11
  from typing import Any, Callable, Dict, Hashable, Tuple
12
 
13
  import torch