Stable Audio 3 — MNN CUDA Models

Pre-converted MNN models for Stable Audio 3 inference with MNN CUDA backend, powered by stable-audio-3-rs.

Test Environment

Component	Model
GPU	NVIDIA RTX 2080 Ti (22 GB VRAM)
CPU	AMD Ryzen 9 5900X (12C/24T)
OS	Windows 11
CUDA	12.8
MNN	Custom fork with Softmax fix

Model Summary

Model	Quantization	Size	Backend
T5Gemma (Text Encoder)	INT4 + FP16 Embed	538 MB	MNN CPU
NumberConditioner	INT8	0.2 MB	MNN CUDA
DiT (Diffusion Transformer)	INT8	445 MB	MNN CUDA
Decoder (SAME-S)	INT8 (FusedWN)	53 MB	MNN CUDA
Encoder (music-to-music)	INT8	52 MB	MNN CUDA
Total		~1.09 GB

Performance (`--mnn-int8`, 8 steps)

Music

Duration	T5	NC	DiT	Decoder	Total	RTF	VRAM
10s	0.59s	0.01s	0.22s	1.25s	2.07s	4.8x	~1.6 GB
30s	0.55s	0.01s	0.32s	0.68s	1.56s	19.2x	~1.6 GB
60s	0.58s	0.01s	0.50s	1.03s	2.13s	28.2x	~1.6 GB
120s	0.58s	0.01s	1.01s	2.05s	3.68s	32.7x	~1.6 GB

SFX

Duration	T5	NC	DiT	Decoder	Total	RTF	VRAM
10s	0.93s	0.01s	0.27s	1.44s	2.65s	3.8x	~1.6 GB
30s	0.97s	0.01s	0.32s	0.69s	2.00s	15.0x	~1.6 GB
60s	0.57s	0.01s	0.48s	1.02s	2.10s	28.6x	~1.6 GB
120s	0.59s	0.01s	0.99s	1.99s	3.62s	33.2x	~1.6 GB

VRAM = incremental GPU memory used by inference (excluding desktop/display baseline). Decoder uses chunked decoding (~23.8s per chunk), so VRAM is constant regardless of audio length.

Quantization Strategy

Model	Method	Size Reduction	Tool
T5	INT4 (MatMul) + FP16 (Embedding)	1075→538 MB (-50%)	MNNConvert `--weightQuantBits 4 --fp16`
DiT	INT8 (Conv/MatMul weights)	1754→445 MB (-75%)	MNNConvert `--weightQuantBits 8`
Decoder	INT8 (Conv weights)	200→53 MB (-74%)	MNNConvert `--weightQuantBits 8`
Encoder	INT8 (Conv weights)	205→52 MB (-75%)	MNNConvert `--weightQuantBits 8`
NC	INT8 (MatMul weights)	~0.2 MB	MNNConvert `--weightQuantBits 8`

T5 Encoder

The T5Gemma text encoder runs on MNN CPU with INT4 quantization. INT4 is the default mode — use --mnn-t5-fp32 to fall back to FP32 precision.

MNN CUDA for T5 produces incorrect output (max_diff=26 vs CPU reference) — the root cause was extensively investigated but not isolated to a single CUDA kernel. MNN CPU produces identical results to ONNX Runtime.

All other models (NC, DiT, Decoder, Encoder) run on MNN CUDA with negligible differences from CPU reference.

Model Architecture

Text Prompt → T5Gemma (MNN CPU INT4) → Text Embedding
Duration   → NumberConditioner (MNN CUDA INT8) → Duration Embedding
                                    ↓
                        DiT (MNN CUDA INT8) ← Diffusion Denoising
                                    ↓
                        Decoder (MNN CUDA INT8 FusedWN) → Audio Waveform

Model	Runtime	Precision	File
T5Gemma (Text Encoder)	MNN CPU	INT4+FP16 (default) / FP32	`shared/text_encoder_int4.mnn` / `shared/text_encoder.mnn`
NumberConditioner	MNN CUDA	INT8	`{variant}/int8/number_conditioner_{variant}_int8.mnn`
DiT (Diffusion Transformer)	MNN CUDA	INT8	`{variant}/int8/dit_{variant}_int8.mnn`
Decoder (SAME-S)	MNN CUDA	INT8 (FusedWN)	`shared/decoder_fused_wn.mnn`
Encoder	MNN CUDA	INT8	`{variant}/int8/encoder_int8.mnn`

Repository Structure

├── music/int8/          # Music variant — INT8 models
│   ├── dit_music_int8.mnn           (445 MB)
│   ├── encoder_int8.mnn             (52 MB)
│   └── number_conditioner_music_int8.mnn  (0.2 MB)
├── music/fp16/          # Music variant — FP16 models (fallback)
│   ├── dit_music_fp16_v2_f32io.mnn
│   ├── encoder.mnn
│   └── number_conditioner_music_fp16.mnn
├── sfx/int8/            # SFX variant — INT8 models
│   ├── dit_sfx_int8.mnn             (445 MB)
│   └── number_conditioner_sfx_int8.mnn    (0.2 MB)
├── sfx/fp16/            # SFX variant — FP16 models (fallback)
│   ├── dit_sfx_fp16_v2_f32io.mnn
│   └── number_conditioner_sfx_fp16.mnn
├── shared/              # Shared across variants
│   ├── decoder_fused_wn.mnn         (53 MB, INT8)
│   ├── text_encoder_int4.mnn        (538 MB, INT4 — default)
│   ├── text_encoder.mnn             (1075 MB, FP32 — fallback)
│   ├── tokenizer.json
│   └── bottleneck_params.json
└── dll/                 # Pre-built Windows DLLs
    ├── MNN.dll
    └── mnn_dit_bridge.dll

Key Features

INT8 Full Pipeline: T5 INT4 + DiT INT8 + Decoder INT8 + Encoder INT8. Total model size ~1.09 GB, RTF 10-33x on RTX 2080 Ti
Chunked Decoding: Decoder processes latents in chunks of 256 timesteps, enabling pseudo-streaming output (~23.8s of audio per chunk) and constant VRAM regardless of audio length
Pre-allocated Memory: Decoder is initialized with chunk_size=256 at load time, eliminating expensive resize during inference
WeightNorm Pre-fusion: Decoder Conv1d WeightNorm is pre-fused into weights before ONNX→MNN conversion, avoiding FP16 precision issues in L2 normalization
Softmax Fix: MNN CUDA Softmax kernel patched for small axis sizes (≤64) — see MNN fork

MNN CUDA Bug Fixes

Our MNN fork includes these patches:

Softmax Fix: SOFTMAX_AXIS_REDUCE kernel had a bug where idle threads in block reduction corrupted __shared__ memory for small axis sizes, causing echo artifacts in Decoder attention. Fixed by using simple per-thread SOFTMAX kernel for axis ≤ 64.
MatMul PrecisionType Fix: FP16FP32Mix mode incorrectly set precisionType=2 (always FP16), causing precision anomalies. Fixed to use precisionType=0 for FP16FP32Mix/FP32 modes.
LoopCreator Op Type Check: CUDALoop::onExecute only supports UnaryOp/MatMul/BinaryOp for non-fused commands. Other op types (LayerNorm, Softmax, Select, etc.) were silently skipped, producing garbage output. LoopCreator now rejects unsupported op types at creation time, falling back to CPU.
T5 Forced to CPU: T5 MNN CUDA produces incorrect output (root cause not isolated). T5 always runs on MNN CPU backend, producing correct results verified against ONNX Runtime.

Usage

See stable-audio-3-rs for the Rust inference library and CLI tool.

# INT8 mode (recommended, full pipeline)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1 --mnn-int8

# FP16 mode (higher precision fallback)
sa3-cli --prompt "ambient electronic music" --duration 30 --steps 8 --mnn --mnn-gpu 1

License

Stable Audio Community License — see LICENSE for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support