flashrt/fp4-fused-ops

FlashRT fused FP16-to-NVFP4 producer kernels for transformer and diffuser low-bit paths.

Functions

  • sfa_size_bytes
  • rms_norm_fp4_sfa_fp16
  • residual_add_rms_norm_fp4_sfa_fp16
  • residual_add_rms_norm_fp4_sfa_v2_fp16
  • residual_add_rms_norm_mul_fp4_sfa_fp16
  • silu_mul_fp4_sfa_fp16
  • silu_mul_fp4_sfa_v2_fp16
  • silu_mul_mul_fp4_sfa_v2_fp16
  • silu_mul_two_fp4_to_fp4
  • silu_mul_two_mul_fp4_to_fp4
  • dequantize_fp4_sfa_fp16

This package targets Blackwell sm_120a and uses CUTLASS/CUTE SFA layouts.

Example

from kernels import get_kernel
import torch

ops = get_kernel("flashrt/fp4-fused-ops", version=1, trust_remote_code=True)

merged = torch.randn((16, 4096), device="cuda", dtype=torch.float16)
packed, sfa = ops.silu_mul_fp4_sfa_v2_fp16(merged)

# Debug only; normal low-bit pipelines should pass packed/SFA to FP4 GEMM.
bf16_view = ops.dequantize_fp4_sfa_fp16(packed, sfa)

Shape Contract

  • CUDA tensors only.
  • FP16 producer inputs, uint8 FP4 packed outputs, uint8 CUTLASS SFA buffers.
  • Dimensions must be divisible by 16.
  • v1 RMS producer paths support dim <= 2048.
  • Larger residual/RMS producer shapes should use residual_add_rms_norm_fp4_sfa_v2_fp16.
Downloads last month
-
Supported hardwares new
CUDA
12.0a
DGX Spark
GB10
128GB
GPU
RTX PRO 6000 WS
96GB
GPU
RTX PRO 6000 Max-Q
96GB
GPU
RTX PRO 5000
48GB
GPU
RTX PRO 4500 WS
32GB
GPU
RTX PRO 4000
24GB
GPU
RTX PRO 4000 SFF
24GB
GPU
RTX PRO 2000
16GB
RTX
RTX 5090
32GB
RTX
RTX 5090 D
32GB
RTX
RTX 5090 Mobile
24GB
RTX
RTX 5080
16GB
RTX
RTX 5080 Mobile
16GB
RTX
RTX 5070
12GB
RTX
RTX 5070 Mobile
8GB
RTX
RTX 5070 Ti
16GB
RTX
RTX 5070 Ti Mobile
12GB
RTX
RTX 5060 Ti
16GB
RTX
RTX 5060
8GB
RTX
RTX 5060 Mobile
8GB
OS
linux
Arch
x86_64