Sidon β€” ONNX (speech restoration / dereverberation)

Cross-platform speech restoration (denoise + dereverberation + bandwidth restoration) exported to ONNX, for ONNX Runtime on Linux / Windows / Android / macOS. Turns a noisy/reverberant clip into studio-quality 48 kHz speech β€” ideal for cleaning a voice-cloning reference before TTS (it preserves speaker identity).

Two-stage pipeline:

16 kHz audio β†’ [w2v-BERT log-mel front-end] β†’ predictor (w2v-BERT 2.0, 8 layers)
            β†’ cleansed features [1, T, 1024] β†’ DAC decoder β†’ 48 kHz audio

Variants

variant precision size parity vs FP32 RTF (CPU)
int8 weight-only int8, per-channel (activations FP) 286 MB cos 0.96, speaker 0.791 β‰ˆ FP32 0.795 ~4Γ—
fp16 FP16 (attention kept FP32) 470 MB cos 0.999, SI-SDR 28 dB (near-lossless) ~4.2Γ—
fp32 FP32 939 MB bit-exact ~4.6Γ—

Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output 48 kHz. FP16 uses ORT's transformer-aware converter (attention stays FP32).

int8 is weight-only, per-channel (weights stored int8 + DequantizeLinear; activations stay FP) β€” predictor 186 MB + FP16 vocoder. This preserves quality (waveform cos 0.96, speaker 0.791 β‰ˆ FP32's 0.795). The key is to never quantize activations on this SSL encoder: activation-quantizing int8 (dynamic / static QDQ), and uniform-per-tensor / 4-bit weight quant, all corrupt it (cos 0.26–0.59) because the CNN/attention activations have outliers the DAC vocoder amplifies. (ORT has no native palette/LUT op; a Gather-codebook can reproduce CoreML's exact k-means weights but ORT constant-folds it to dense at load β†’ disk-only, and per-tensor k-means scored 0.76 here β€” worse than per-channel int8, so it isn't worth it.)

Files

path description
int8/sidon-predictor.onnx predictor, weight-only int8 per-channel (DequantizeLinear)
int8/sidon-vocoder.onnx DAC decoder (FP16)
fp16/sidon-predictor.onnx w2v-BERT 2.0 (8L) + merged LoRA β†’ features (FP16)
fp16/sidon-vocoder.onnx DAC decoder β†’ 48 kHz audio (FP16)
fp32/sidon-predictor.onnx predictor (FP32, bit-exact)
fp32/sidon-vocoder.onnx DAC decoder (FP32)

Quality (no-reference MOS)

FP32 is bit-exact to the PyTorch reference; FP16 is near-lossless (waveform cos 0.999).

audio DNSMOS OVRL UTMOS speaker cos
input (reverberant) 2.90 2.99 β€”
restored (FP32 β‰ˆ FP16) 3.29 3.40 0.795

Restoration lifts OVRL 2.90 β†’ 3.29, driven by BAK 3.40 β†’ 4.11 (reverb removed).

Front-end

The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T feature extractor (16 kHz). The ONNX graphs use a dynamic time axis (chunk in the runtime). Front-end + chunking are handled by the consuming engine (speech-core).

Usage

import onnxruntime as ort, numpy as np
pred = ort.InferenceSession("fp16/sidon-predictor.onnx")
voc  = ort.InferenceSession("fp16/sidon-vocoder.onnx")
# input_features [1, T, 160] from the w2v-BERT 2.0 feature extractor (16 kHz)
feats = pred.run(None, {"input_features": input_features})[0]
audio = voc.run(None, {"features": feats})[0]   # [1, M] @ 48 kHz

Source

Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.

Links

Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Sidon-ONNX

Quantized
(2)
this model

Collection including soniqo/Sidon-ONNX

Paper for soniqo/Sidon-ONNX