Add ApplyRoPE and RMSNorm kernels written in OpenAI Triton

#8
No description provided.

This PR add kernels of ApplyRoPE and RMSNorm written in OpenAI Triton. These kernels offer better performance, support a wider range of GPU architectures (including V100 and T4), and require no pre-compilation, compared with flash-attn. They are enabled automatically if Triton is installed (usually bundled with PyTorch 2.x).

wangzihan99 changed pull request status to open
wangzihan99 changed pull request status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment