SA3-medium DiT β€” FP8 TensorRT artifacts

FP8 GEMM-trunk quantization of the Stable Audio 3 medium DiT, built from stabilityai/stable-audio-3-optimized onnx/sa3-m/dit_fp16mixed.onnx with the producer recipe in Stability-AI/stable-audio-3 PR #47 (build/make_calib.py + build/build_dit_fp8.py). This is a derivative of Stability AI's model weights and is distributed under the Stability AI Community License; see the base model for terms.

Contents

  • onnx/sa3-m/dit_fp8.onnx + dit_fp8.onnx.data β€” the quantized ONNX (arch-independent; compile with build_from_onnx.py sa3-m-fp8, plain STRONGLY_TYPED, no ModelOpt needed)
  • tensorRT/sm_120/sa3-m/dit_fp8.trt β€” prebuilt engine for RTX 50xx (sm_120), TensorRT 10.16.1.11. TRT engines are not portable across GPU architectures or TRT minor versions; rebuild from the ONNX for anything else.

Validation (vs the FP16-mixed engine, 47 prompts x 8 sigmas, L=646, RTX 5090)

  • worst single-step latent cosine (x + dt*v, n=376): 0.9982
  • 8-step compounded euler final-latent cosine over 47 prompts: mean 0.953, median 0.957, worst 0.873 (the rollout is chaotic; a 1e-3 input perturbation alone compounds to ~0.967, so this is a guide, not a gate)
  • decoded audio under the production pingpong sampler tracks the FP16-mixed generation at ~0.90 RMS-curve correlation (same conditioning and seeds) and was validated by listening: the published sm_120 engine here is the exact engine that passed that test
  • step latency B=1 L=646: 10.6-11.0 ms vs ~18.7-19.4 ms FP16-mixed (1.8x)
  • under the stochastic pingpong sampler the engine produces a different but comparable sample

Inputs/outputs are FP32, drop-in for the FP16-mixed DiT engine (sa3_trt --precision fp8, paired with the FP16-mixed decoder).

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ryanontheinside/stable-audio-3-optimized-fp8