SA3-medium DiT — FP8 TensorRT artifacts

FP8 GEMM-trunk quantization of the Stable Audio 3 medium DiT, built from stabilityai/stable-audio-3-optimized onnx/sa3-m/dit_fp16mixed.onnx with the producer recipe in Stability-AI/stable-audio-3 PR #47 (build/make_calib.py + build/build_dit_fp8.py). This is a derivative of Stability AI's model weights and is distributed under the Stability AI Community License; see the base model for terms.

onnx/sa3-m/dit_fp8.onnx + dit_fp8.onnx.data — the quantized ONNX (arch-independent; compile with build_from_onnx.py sa3-m-fp8, plain STRONGLY_TYPED, no ModelOpt needed)
tensorRT/sm_120/sa3-m/dit_fp8.trt — prebuilt engine for RTX 50xx (sm_120), TensorRT 10.16.1.11. TRT engines are not portable across GPU architectures or TRT minor versions; rebuild from the ONNX for anything else.

Validation (vs the FP16-mixed engine, 47 prompts x 8 sigmas, L=646, RTX 5090)

worst single-step latent cosine (x + dt*v, n=376): 0.9982
8-step compounded euler final-latent cosine over 47 prompts: mean 0.953, median 0.957, worst 0.873 (the rollout is chaotic; a 1e-3 input perturbation alone compounds to ~0.967, so this is a guide, not a gate)
decoded audio under the production pingpong sampler tracks the FP16-mixed generation at ~0.90 RMS-curve correlation (same conditioning and seeds) and was validated by listening: the published sm_120 engine here is the exact engine that passed that test
step latency B=1 L=646: ~~10.6-11.0 ms vs ~18.7-19.4 ms FP16-mixed (~~1.8x)
under the stochastic pingpong sampler the engine produces a different but comparable sample

Inputs/outputs are FP32, drop-in for the FP16-mixed DiT engine (sa3_trt --precision fp8, paired with the FP16-mixed decoder).

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ryanontheinside/stable-audio-3-optimized-fp8

Base model

stabilityai/stable-audio-3-medium-base

Finetuned

stabilityai/stable-audio-3-medium

Quantized

(1)

this model

ryanontheinside
/

stable-audio-3-optimized-fp8

SA3-medium DiT — FP8 TensorRT artifacts

Contents

Validation (vs the FP16-mixed engine, 47 prompts x 8 sigmas, L=646, RTX 5090)

Model tree for ryanontheinside/stable-audio-3-optimized-fp8