Stable Audio 3 β€” MNN CUDA Models

Pre-converted MNN models for Stable Audio 3 inference with MNN CUDA backend, powered by stable-audio-3-rs.

Test Environment

Component Model
GPU NVIDIA RTX 2080 Ti (22 GB VRAM)
CPU AMD Ryzen 9 5900X (12C/24T)
OS Windows 11
CUDA 12.8
MNN Custom fork with Softmax fix

Model Summary

Model Quantization Size Backend
T5Gemma (Text Encoder) INT4 + FP16 Embed 538 MB MNN CPU
NumberConditioner INT8 0.2 MB MNN CUDA
DiT (Diffusion Transformer) INT8 445 MB MNN CUDA
Decoder (SAME-S) INT8 (FusedWN) 53 MB MNN CUDA
Encoder (music-to-music) INT8 52 MB MNN CUDA
Total ~1.09 GB

Performance (--mnn-int8, 8 steps)

Music

Duration T5 NC DiT Decoder Total RTF VRAM
10s 0.59s 0.01s 0.22s 1.25s 2.07s 4.8x ~1.6 GB
30s 0.55s 0.01s 0.32s 0.68s 1.56s 19.2x ~1.6 GB
60s 0.58s 0.01s 0.50s 1.03s 2.13s 28.2x ~1.6 GB
120s 0.58s 0.01s 1.01s 2.05s 3.68s 32.7x ~1.6 GB

SFX

Duration T5 NC DiT Decoder Total RTF VRAM
10s 0.93s 0.01s 0.27s 1.44s 2.65s 3.8x ~1.6 GB
30s 0.97s 0.01s 0.32s 0.69s 2.00s 15.0x ~1.6 GB
60s 0.57s 0.01s 0.48s 1.02s 2.10s 28.6x ~1.6 GB
120s 0.59s 0.01s 0.99s 1.99s 3.62s 33.2x ~1.6 GB

VRAM = incremental GPU memory used by inference (excluding desktop/display baseline). Decoder uses chunked decoding (~23.8s per chunk), so VRAM is constant regardless of audio length.

Quantization Strategy

Model Method Size Reduction Tool
T5 INT4 (MatMul) + FP16 (Embedding) 1075β†’538 MB (-50%) MNNConvert --weightQuantBits 4 --fp16
DiT INT8 (Conv/MatMul weights) 1754β†’445 MB (-75%) MNNConvert --weightQuantBits 8
Decoder INT8 (Conv weights) 200β†’53 MB (-74%) MNNConvert --weightQuantBits 8
Encoder INT8 (Conv weights) 205β†’52 MB (-75%) MNNConvert --weightQuantBits 8
NC INT8 (MatMul weights) ~0.2 MB MNNConvert --weightQuantBits 8

T5 Encoder

The T5Gemma text encoder runs on MNN CPU with INT4 quantization. INT4 is the default mode β€” use --mnn-t5-fp32 to fall back to FP32 precision.

MNN CUDA for T5 produces incorrect output (max_diff=26 vs CPU reference) β€” the root cause was extensively investigated but not isolated to a single CUDA kernel. MNN CPU produces identical results to ONNX Runtime.

All other models (NC, DiT, Decoder, Encoder) run on MNN CUDA with negligible differences from CPU reference.

Model Architecture

Text Prompt β†’ T5Gemma (MNN CPU INT4) β†’ Text Embedding
Duration   β†’ NumberConditioner (MNN CUDA INT8) β†’ Duration Embedding
                                    ↓
                        DiT (MNN CUDA INT8) ← Diffusion Denoising
                                    ↓
                        Decoder (MNN CUDA INT8 FusedWN) β†’ Audio Waveform
Model Runtime Precision File
T5Gemma (Text Encoder) MNN CPU INT4+FP16 (default) / FP32 shared/text_encoder_int4.mnn / shared/text_encoder.mnn
NumberConditioner MNN CUDA INT8 {variant}/int8/number_conditioner_{variant}_int8.mnn
DiT (Diffusion Transformer) MNN CUDA INT8 {variant}/int8/dit_{variant}_int8.mnn
Decoder (SAME-S) MNN CUDA INT8 (FusedWN) shared/decoder_fused_wn.mnn
Encoder MNN CUDA INT8 {variant}/int8/encoder_int8.mnn

Repository Structure

β”œβ”€β”€ music/int8/          # Music variant β€” INT8 models
β”‚   β”œβ”€β”€ dit_music_int8.mnn           (445 MB)
β”‚   β”œβ”€β”€ encoder_int8.mnn             (52 MB)
β”‚   └── number_conditioner_music_int8.mnn  (0.2 MB)
β”œβ”€β”€ music/fp16/          # Music variant β€” FP16 models (fallback)
β”‚   β”œβ”€β”€ dit_music_fp16_v2_f32io.mnn
β”‚   β”œβ”€β”€ encoder.mnn
β”‚   └── number_conditioner_music_fp16.mnn
β”œβ”€β”€ sfx/int8/            # SFX variant β€” INT8 models
β”‚   β”œβ”€β”€ dit_sfx_int8.mnn             (445 MB)
β”‚   └── number_conditioner_sfx_int8.mnn    (0.2 MB)
β”œβ”€β”€ sfx/fp16/            # SFX variant β€” FP16 models (fallback)
β”‚   β”œβ”€β”€ dit_sfx_fp16_v2_f32io.mnn
β”‚   └── number_conditioner_sfx_fp16.mnn
β”œβ”€β”€ shared/              # Shared across variants
β”‚   β”œβ”€β”€ decoder_fused_wn.mnn         (53 MB, INT8)
β”‚   β”œβ”€β”€ text_encoder_int4.mnn        (538 MB, INT4 β€” default)
β”‚   β”œβ”€β”€ text_encoder.mnn             (1075 MB, FP32 β€” fallback)
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── bottleneck_params.json
└── dll/                 # Pre-built Windows DLLs
    β”œβ”€β”€ MNN.dll
    └── mnn_dit_bridge.dll

Key Features

  • INT8 Full Pipeline: T5 INT4 + DiT INT8 + Decoder INT8 + Encoder INT8. Total model size ~1.09 GB, RTF 10-33x on RTX 2080 Ti
  • Chunked Decoding: Decoder processes latents in chunks of 256 timesteps, enabling pseudo-streaming output (~23.8s of audio per chunk) and constant VRAM regardless of audio length
  • Pre-allocated Memory: Decoder is initialized with chunk_size=256 at load time, eliminating expensive resize during inference
  • WeightNorm Pre-fusion: Decoder Conv1d WeightNorm is pre-fused into weights before ONNXβ†’MNN conversion, avoiding FP16 precision issues in L2 normalization
  • Softmax Fix: MNN CUDA Softmax kernel patched for small axis sizes (≀64) β€” see MNN fork

MNN CUDA Bug Fixes

Our MNN fork includes these patches:

  1. Softmax Fix: SOFTMAX_AXIS_REDUCE kernel had a bug where idle threads in block reduction corrupted __shared__ memory for small axis sizes, causing echo artifacts in Decoder attention. Fixed by using simple per-thread SOFTMAX kernel for axis ≀ 64.

  2. MatMul PrecisionType Fix: FP16FP32Mix mode incorrectly set precisionType=2 (always FP16), causing precision anomalies. Fixed to use precisionType=0 for FP16FP32Mix/FP32 modes.

  3. LoopCreator Op Type Check: CUDALoop::onExecute only supports UnaryOp/MatMul/BinaryOp for non-fused commands. Other op types (LayerNorm, Softmax, Select, etc.) were silently skipped, producing garbage output. LoopCreator now rejects unsupported op types at creation time, falling back to CPU.

  4. T5 Forced to CPU: T5 MNN CUDA produces incorrect output (root cause not isolated). T5 always runs on MNN CPU backend, producing correct results verified against ONNX Runtime.

Usage

See stable-audio-3-rs for the Rust inference library and CLI tool.

# INT8 mode (recommended, full pipeline)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1 --mnn-int8

# FP16 mode (higher precision fallback)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1

License

Stable Audio Community License β€” see LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support