Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton

#10
No description provided.
Cheshire94 changed pull request title from pr/9 to pr/10

This PR add kernels of ApplyRoPE and RMSNorm written in OpenAI Triton. These kernels offer better performance, support a wider range of GPU architectures (including V100 and T4), and require no pre-compilation, compared with flash-attn. They are enabled automatically if Triton is installed (usually bundled with PyTorch 2.x).

Cheshire94 changed pull request title from pr/10 to Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton
Cheshire94 changed pull request status to closed

Sign up or log in to comment