Kernels:

flashrt
/

fp4-fused-ops

Kernel card Files Files and versions

flashrt/fp4-fused-ops

FlashRT fused FP16-to-NVFP4 producer kernels for transformer and diffuser low-bit paths.

Functions

sfa_size_bytes
rms_norm_fp4_sfa_fp16
residual_add_rms_norm_fp4_sfa_fp16
residual_add_rms_norm_fp4_sfa_v2_fp16
residual_add_rms_norm_mul_fp4_sfa_fp16
silu_mul_fp4_sfa_fp16
silu_mul_fp4_sfa_v2_fp16
silu_mul_mul_fp4_sfa_v2_fp16
silu_mul_two_fp4_to_fp4
silu_mul_two_mul_fp4_to_fp4
dequantize_fp4_sfa_fp16

This package targets Blackwell sm_120a and uses CUTLASS/CUTE SFA layouts.

Example

from kernels import get_kernel
import torch

ops = get_kernel("flashrt/fp4-fused-ops", version=1, trust_remote_code=True)

merged = torch.randn((16, 4096), device="cuda", dtype=torch.float16)
packed, sfa = ops.silu_mul_fp4_sfa_v2_fp16(merged)

# Debug only; normal low-bit pipelines should pass packed/SFA to FP4 GEMM.
bf16_view = ops.dequantize_fp4_sfa_fp16(packed, sfa)

Shape Contract

CUDA tensors only.
FP16 producer inputs, uint8 FP4 packed outputs, uint8 CUTLASS SFA buffers.
Dimensions must be divisible by 16.
v1 RMS producer paths support dim <= 2048.
Larger residual/RMS producer shapes should use residual_add_rms_norm_fp4_sfa_v2_fp16.

Downloads last month: -

Supported hardwares new

CUDA

12.0a

DGX Spark

GB10

128GB

GPU

RTX PRO 6000 WS

96GB

GPU

RTX PRO 6000 Max-Q

96GB

GPU

RTX PRO 5000

48GB

GPU

RTX PRO 4500 WS

32GB

GPU

RTX PRO 4000

24GB

GPU

RTX PRO 4000 SFF

24GB

GPU

RTX PRO 2000

16GB

RTX

RTX 5090

32GB

RTX

RTX 5090 D

32GB

RTX

RTX 5090 Mobile

24GB

RTX

RTX 5080

16GB

RTX

RTX 5080 Mobile

16GB

RTX

RTX 5070

12GB

RTX

RTX 5070 Mobile

8GB

RTX

RTX 5070 Ti

16GB

RTX

RTX 5070 Ti Mobile

12GB

RTX

RTX 5060 Ti

16GB

RTX

RTX 5060

8GB

RTX

RTX 5060 Mobile

8GB

OS: linux

Arch: x86_64