Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton
#8
by
wangzihan99
- opened
No description provided.
This PR add kernels of ApplyRoPE and RMSNorm written in OpenAI Triton. These kernels offer better performance, support a wider range of GPU architectures (including V100 and T4), and require no pre-compilation, compared with flash-attn
. They are enabled automatically if Triton is installed (usually bundled with PyTorch 2.x).
wangzihan99
changed pull request status to
open
wangzihan99
changed pull request status to
closed