Instructions to use ryanontheinside/stable-audio-3-optimized-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use ryanontheinside/stable-audio-3-optimized-fp8 with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
SA3-medium DiT β FP8 TensorRT artifacts
FP8 GEMM-trunk quantization of the Stable Audio 3 medium DiT, built from
stabilityai/stable-audio-3-optimized onnx/sa3-m/dit_fp16mixed.onnx with the
producer recipe in Stability-AI/stable-audio-3 PR #47
(build/make_calib.py + build/build_dit_fp8.py). This is a derivative of
Stability AI's model weights and is distributed under the Stability AI
Community License; see the base model for terms.
Contents
onnx/sa3-m/dit_fp8.onnx+dit_fp8.onnx.dataβ the quantized ONNX (arch-independent; compile withbuild_from_onnx.py sa3-m-fp8, plain STRONGLY_TYPED, no ModelOpt needed)tensorRT/sm_120/sa3-m/dit_fp8.trtβ prebuilt engine for RTX 50xx (sm_120), TensorRT 10.16.1.11. TRT engines are not portable across GPU architectures or TRT minor versions; rebuild from the ONNX for anything else.
Validation (vs the FP16-mixed engine, 47 prompts x 8 sigmas, L=646, RTX 5090)
- worst single-step latent cosine (x + dt*v, n=376): 0.9982
- 8-step compounded euler final-latent cosine over 47 prompts: mean 0.953, median 0.957, worst 0.873 (the rollout is chaotic; a 1e-3 input perturbation alone compounds to ~0.967, so this is a guide, not a gate)
- decoded audio under the production pingpong sampler tracks the FP16-mixed generation at ~0.90 RMS-curve correlation (same conditioning and seeds) and was validated by listening: the published sm_120 engine here is the exact engine that passed that test
- step latency B=1 L=646:
10.6-11.0 ms vs ~18.7-19.4 ms FP16-mixed (1.8x) - under the stochastic pingpong sampler the engine produces a different but comparable sample
Inputs/outputs are FP32, drop-in for the FP16-mixed DiT engine
(sa3_trt --precision fp8, paired with the FP16-mixed decoder).
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for ryanontheinside/stable-audio-3-optimized-fp8
Base model
stabilityai/stable-audio-3-medium-base Finetuned
stabilityai/stable-audio-3-medium