Stable Audio Morph β ONNX Models
ONNX FP16 exports of the SAME autoencoder from Stable Audio 3 Small-Music for browser-based latent space audio morphing.
Models
| File | Size | Description |
|---|---|---|
encoder_fp16.onnx |
104 MB | Encodes 10s stereo audio (44.1kHz) to latent [1, 256, 108] |
decoder_fp16.onnx |
105 MB | Decodes latent [1, 256, 108] to stereo audio |
Usage
These models run in the browser via ONNX Runtime Web with WebGPU or WASM backends.
import * as ort from "onnxruntime-web";
const encoder = await ort.InferenceSession.create("encoder_fp16.onnx");
const decoder = await ort.InferenceSession.create("decoder_fp16.onnx");
// Encode audio to latent
const input = new ort.Tensor("float32", audioData, [1, 2, 441000]);
const { latent } = await encoder.run({ audio: input });
// Decode latent to audio
const { audio } = await decoder.run({ latent });
Source
- Original model: stabilityai/stable-audio-3-small-music
- Original repo: Stability-AI/stable-audio-3
- Paper: Stable Audio 3 (Stability AI, 2025)
- Autoencoder: SAME (Semantic-Acoustic autoencoder), 108M parameters
- Compression ratio: 4096x (44.1kHz stereo to 256-dim latent at ~10.8 Hz)
- Training data: AudioSparx (806K recordings) + Freesound (472K recordings)
Export Details
- Exported from SA3 Small-Music checkpoint using
torch.onnx.export(opset 18) - Converted to FP16 via
onnxconverter-common - Validated: round-trip correlation > 0.99
- No text encoder included (T5Gemma removed for latent-only usage)
Technical Specs
- Input (encoder):
[1, 2, 441000]float32 β 10 seconds stereo at 44.1kHz - Output (encoder):
[1, 256, 108]float32 β 256-dim latent, 108 temporal frames - Input (decoder):
[1, 256, 108]float32 - Output (decoder):
[1, 2, 442368]float32 β stereo audio
License
These weights are derived from Stability AI's Stable Audio 3 model. Usage is subject to the Stability AI Community License.
Model tree for shoegazerstella/stable-audio-morph-onnx
Base model
stabilityai/stable-audio-3-small-music-base Finetuned
stabilityai/stable-audio-3-small-music