FlashRT Linear Attention Primitives
This package contains CUDA kernels for BF16 small-M linear projections and linear-attention layout/gating helpers. It is designed for transformer decode and verify hot paths where a full PyTorch op sequence creates excessive launch and memory traffic.
Available Functions
bf16_matvecbf16_smallm_matmulsplit_qkv_broadcast_bf16partial_rope_qk_bf16gated_delta_prepare_bf16
Usage
from kernels import get_kernel
lap = get_kernel("flashrt/linear-attention-primitives")
out = lap.bf16_matvec(x, weight)
q, k, v = lap.split_qkv_broadcast_bf16(packed, 16, 16, 48, 128)
The APIs are Tensor APIs, not FlashRT serving-internal pointer APIs. They can also be called with preallocated output tensors for static-buffer runtimes.
Scope
The first release covers the strict source-validated subset used by FlashRT runtime experiments. It does not package generic FlashAttention, which is already available in the Hugging Face kernels ecosystem.
- Downloads last month
- 1
cuda
flashrt
transformers
linear-attention
bf16
apache-2.0
Supported hardwares new
CUDA
- OS
- linux
- Arch
- x86_64





