Stable Audio 3 β MNN CUDA Models
Pre-converted MNN models for Stable Audio 3 inference with MNN CUDA backend, powered by stable-audio-3-rs.
Test Environment
| Component | Model |
|---|---|
| GPU | NVIDIA RTX 2080 Ti (22 GB VRAM) |
| CPU | AMD Ryzen 9 5900X (12C/24T) |
| OS | Windows 11 |
| CUDA | 12.8 |
| MNN | Custom fork with Softmax fix |
Model Summary
| Model | Quantization | Size | Backend |
|---|---|---|---|
| T5Gemma (Text Encoder) | INT4 + FP16 Embed | 538 MB | MNN CPU |
| NumberConditioner | INT8 | 0.2 MB | MNN CUDA |
| DiT (Diffusion Transformer) | INT8 | 445 MB | MNN CUDA |
| Decoder (SAME-S) | INT8 (FusedWN) | 53 MB | MNN CUDA |
| Encoder (music-to-music) | INT8 | 52 MB | MNN CUDA |
| Total | ~1.09 GB |
Performance (--mnn-int8, 8 steps)
Music
| Duration | T5 | NC | DiT | Decoder | Total | RTF | VRAM |
|---|---|---|---|---|---|---|---|
| 10s | 0.59s | 0.01s | 0.22s | 1.25s | 2.07s | 4.8x | ~1.6 GB |
| 30s | 0.55s | 0.01s | 0.32s | 0.68s | 1.56s | 19.2x | ~1.6 GB |
| 60s | 0.58s | 0.01s | 0.50s | 1.03s | 2.13s | 28.2x | ~1.6 GB |
| 120s | 0.58s | 0.01s | 1.01s | 2.05s | 3.68s | 32.7x | ~1.6 GB |
SFX
| Duration | T5 | NC | DiT | Decoder | Total | RTF | VRAM |
|---|---|---|---|---|---|---|---|
| 10s | 0.93s | 0.01s | 0.27s | 1.44s | 2.65s | 3.8x | ~1.6 GB |
| 30s | 0.97s | 0.01s | 0.32s | 0.69s | 2.00s | 15.0x | ~1.6 GB |
| 60s | 0.57s | 0.01s | 0.48s | 1.02s | 2.10s | 28.6x | ~1.6 GB |
| 120s | 0.59s | 0.01s | 0.99s | 1.99s | 3.62s | 33.2x | ~1.6 GB |
VRAM = incremental GPU memory used by inference (excluding desktop/display baseline). Decoder uses chunked decoding (~23.8s per chunk), so VRAM is constant regardless of audio length.
Quantization Strategy
| Model | Method | Size Reduction | Tool |
|---|---|---|---|
| T5 | INT4 (MatMul) + FP16 (Embedding) | 1075β538 MB (-50%) | MNNConvert --weightQuantBits 4 --fp16 |
| DiT | INT8 (Conv/MatMul weights) | 1754β445 MB (-75%) | MNNConvert --weightQuantBits 8 |
| Decoder | INT8 (Conv weights) | 200β53 MB (-74%) | MNNConvert --weightQuantBits 8 |
| Encoder | INT8 (Conv weights) | 205β52 MB (-75%) | MNNConvert --weightQuantBits 8 |
| NC | INT8 (MatMul weights) | ~0.2 MB | MNNConvert --weightQuantBits 8 |
T5 Encoder
The T5Gemma text encoder runs on MNN CPU with INT4 quantization. INT4 is the default mode β use --mnn-t5-fp32 to fall back to FP32 precision.
MNN CUDA for T5 produces incorrect output (max_diff=26 vs CPU reference) β the root cause was extensively investigated but not isolated to a single CUDA kernel. MNN CPU produces identical results to ONNX Runtime.
All other models (NC, DiT, Decoder, Encoder) run on MNN CUDA with negligible differences from CPU reference.
Model Architecture
Text Prompt β T5Gemma (MNN CPU INT4) β Text Embedding
Duration β NumberConditioner (MNN CUDA INT8) β Duration Embedding
β
DiT (MNN CUDA INT8) β Diffusion Denoising
β
Decoder (MNN CUDA INT8 FusedWN) β Audio Waveform
| Model | Runtime | Precision | File |
|---|---|---|---|
| T5Gemma (Text Encoder) | MNN CPU | INT4+FP16 (default) / FP32 | shared/text_encoder_int4.mnn / shared/text_encoder.mnn |
| NumberConditioner | MNN CUDA | INT8 | {variant}/int8/number_conditioner_{variant}_int8.mnn |
| DiT (Diffusion Transformer) | MNN CUDA | INT8 | {variant}/int8/dit_{variant}_int8.mnn |
| Decoder (SAME-S) | MNN CUDA | INT8 (FusedWN) | shared/decoder_fused_wn.mnn |
| Encoder | MNN CUDA | INT8 | {variant}/int8/encoder_int8.mnn |
Repository Structure
βββ music/int8/ # Music variant β INT8 models
β βββ dit_music_int8.mnn (445 MB)
β βββ encoder_int8.mnn (52 MB)
β βββ number_conditioner_music_int8.mnn (0.2 MB)
βββ music/fp16/ # Music variant β FP16 models (fallback)
β βββ dit_music_fp16_v2_f32io.mnn
β βββ encoder.mnn
β βββ number_conditioner_music_fp16.mnn
βββ sfx/int8/ # SFX variant β INT8 models
β βββ dit_sfx_int8.mnn (445 MB)
β βββ number_conditioner_sfx_int8.mnn (0.2 MB)
βββ sfx/fp16/ # SFX variant β FP16 models (fallback)
β βββ dit_sfx_fp16_v2_f32io.mnn
β βββ number_conditioner_sfx_fp16.mnn
βββ shared/ # Shared across variants
β βββ decoder_fused_wn.mnn (53 MB, INT8)
β βββ text_encoder_int4.mnn (538 MB, INT4 β default)
β βββ text_encoder.mnn (1075 MB, FP32 β fallback)
β βββ tokenizer.json
β βββ bottleneck_params.json
βββ dll/ # Pre-built Windows DLLs
βββ MNN.dll
βββ mnn_dit_bridge.dll
Key Features
- INT8 Full Pipeline: T5 INT4 + DiT INT8 + Decoder INT8 + Encoder INT8. Total model size ~1.09 GB, RTF 10-33x on RTX 2080 Ti
- Chunked Decoding: Decoder processes latents in chunks of 256 timesteps, enabling pseudo-streaming output (~23.8s of audio per chunk) and constant VRAM regardless of audio length
- Pre-allocated Memory: Decoder is initialized with chunk_size=256 at load time, eliminating expensive resize during inference
- WeightNorm Pre-fusion: Decoder Conv1d WeightNorm is pre-fused into weights before ONNXβMNN conversion, avoiding FP16 precision issues in L2 normalization
- Softmax Fix: MNN CUDA Softmax kernel patched for small axis sizes (β€64) β see MNN fork
MNN CUDA Bug Fixes
Our MNN fork includes these patches:
Softmax Fix:
SOFTMAX_AXIS_REDUCEkernel had a bug where idle threads in block reduction corrupted__shared__memory for small axis sizes, causing echo artifacts in Decoder attention. Fixed by using simple per-thread SOFTMAX kernel for axis β€ 64.MatMul PrecisionType Fix: FP16FP32Mix mode incorrectly set
precisionType=2(always FP16), causing precision anomalies. Fixed to useprecisionType=0for FP16FP32Mix/FP32 modes.LoopCreator Op Type Check:
CUDALoop::onExecuteonly supports UnaryOp/MatMul/BinaryOp for non-fused commands. Other op types (LayerNorm, Softmax, Select, etc.) were silently skipped, producing garbage output. LoopCreator now rejects unsupported op types at creation time, falling back to CPU.T5 Forced to CPU: T5 MNN CUDA produces incorrect output (root cause not isolated). T5 always runs on MNN CPU backend, producing correct results verified against ONNX Runtime.
Usage
See stable-audio-3-rs for the Rust inference library and CLI tool.
# INT8 mode (recommended, full pipeline)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1 --mnn-int8
# FP16 mode (higher precision fallback)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1
License
Stable Audio Community License β see LICENSE for details.