Qwen
/

Qwen-7B-Chat-Int4

Text Generation

4-bit precision

Model card Files Files and versions Community

JustinLin610 commited on Jan 2

Commit

1c7d721

•

1 Parent(s): 2cf2e83

update README.md (#12)

- update README.md (92449b352ef8de40b67b1db228705a6105226106)

Files changed (1) hide show

README.md +4 -0

README.md CHANGED Viewed

@@ -67,6 +67,10 @@ cd flash-attention && pip install .
 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
 <br>

 # pip install csrc/layer_norm
 # pip install csrc/rotary
 ```
+如果您有更高推理性能方面的需求，但上述可选加速项`layer_norm`及`rotary`未能安装成功，或是您所使用的GPU不满足`flash-attention`库所要求的NVIDIA Ampere/Ada/Hopper架构，您可以尝试切换至dev_triton分支，使用该分支下基于Triton实现的推理加速方案。该方案适用于更宽范围的GPU产品，在pytorch 2.0及以上版本原生支持，无需额外安装操作。
+If you require higher inference performance yet encounter some problems when installing the optional acceleration features (i.e., `layer_norm` and `rotary`) or if the GPU you are using does not meet the NVIDIA Ampere/Ada/Hopper architecture required by the `flash-attention` library, you may switch to the dev_triton branch and consider trying the inference acceleration solution implemented with Triton in this branch. This solution adapts to a wider range of GPU products and does not require extra package installation with pytorch version 2.0 and above.
 <br>