FlashRT FP8 KV Attention

Native CUDA XQA attention for BF16 queries over FP8 E4M3 paged K/V cache.

Available Functions

  • xqa_bf16_fp8kv
  • causal_spec_mask
  • default_page_table
  • allocate_workspace

Scope

v1 is a fixed-shape public package for the production Qwen3.6-style path:

  • BF16 Q/O
  • FP8 E4M3 K/V cache
  • 24 Q heads, 4 KV heads, head dim 256
  • page size 128
  • speculative/decode q_seq <= 32

This is not a generic FlashAttention replacement. It is the direct FP8-KV XQA decode/verify kernel used to keep long-context transformer runtimes off BF16 KV cache bandwidth.

Minimal Usage

from kernels import get_kernel

attn = get_kernel("flashrt/fp8-kv-attention", trust_remote_code=True)
out = attn.xqa_bf16_fp8kv(q_bf16, k_cache_fp8, v_cache_fp8)

Pass explicit page_table, seq_lens, mask, out, semaphores, and scratch tensors for CUDA Graph/static-buffer runtimes.

Downloads last month
4
cuda
flashrt
attention
fp8
kv-cache
transformers
apache-2.0
Supported hardwares new
CUDA
12.0
DGX Spark
GB10
128GB
GPU
RTX PRO 6000 WS
96GB
GPU
RTX PRO 6000 Max-Q
96GB
GPU
RTX PRO 5000
48GB
GPU
RTX PRO 4500 WS
32GB
GPU
RTX PRO 4000
24GB
GPU
RTX PRO 4000 SFF
24GB
GPU
RTX PRO 2000
16GB
RTX
RTX 5090
32GB
RTX
RTX 5090 D
32GB
RTX
RTX 5090 Mobile
24GB
RTX
RTX 5080
16GB
RTX
RTX 5080 Mobile
16GB
RTX
RTX 5070
12GB
RTX
RTX 5070 Mobile
8GB
RTX
RTX 5070 Ti
16GB
RTX
RTX 5070 Ti Mobile
12GB
RTX
RTX 5060 Ti
16GB
RTX
RTX 5060
8GB
RTX
RTX 5060 Mobile
8GB
OS
linux
Arch
x86_64