Sidon β ONNX (speech restoration / dereverberation)
Cross-platform speech restoration (denoise + dereverberation + bandwidth restoration) exported to ONNX, for ONNX Runtime on Linux / Windows / Android / macOS. Turns a noisy/reverberant clip into studio-quality 48 kHz speech β ideal for cleaning a voice-cloning reference before TTS (it preserves speaker identity).
Two-stage pipeline:
16 kHz audio β [w2v-BERT log-mel front-end] β predictor (w2v-BERT 2.0, 8 layers)
β cleansed features [1, T, 1024] β DAC decoder β 48 kHz audio
Variants
| variant | precision | size | parity vs FP32 | RTF (CPU) |
|---|---|---|---|---|
| int8 | weight-only int8, per-channel (activations FP) | 286 MB | cos 0.96, speaker 0.791 β FP32 0.795 | ~4Γ |
| fp16 | FP16 (attention kept FP32) | 470 MB | cos 0.999, SI-SDR 28 dB (near-lossless) | ~4.2Γ |
| fp32 | FP32 | 939 MB | bit-exact | ~4.6Γ |
Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output 48 kHz. FP16 uses ORT's transformer-aware converter (attention stays FP32).
int8 is weight-only, per-channel (weights stored int8 + DequantizeLinear;
activations stay FP) β predictor 186 MB + FP16 vocoder. This preserves quality
(waveform cos 0.96, speaker 0.791 β FP32's 0.795). The key is to never quantize
activations on this SSL encoder: activation-quantizing int8 (dynamic / static QDQ),
and uniform-per-tensor / 4-bit weight quant, all corrupt it (cos 0.26β0.59) because the
CNN/attention activations have outliers the DAC vocoder amplifies. (ORT has no native
palette/LUT op; a Gather-codebook can reproduce CoreML's exact k-means weights but ORT
constant-folds it to dense at load β disk-only, and per-tensor k-means scored 0.76 here β
worse than per-channel int8, so it isn't worth it.)
Files
| path | description |
|---|---|
int8/sidon-predictor.onnx |
predictor, weight-only int8 per-channel (DequantizeLinear) |
int8/sidon-vocoder.onnx |
DAC decoder (FP16) |
fp16/sidon-predictor.onnx |
w2v-BERT 2.0 (8L) + merged LoRA β features (FP16) |
fp16/sidon-vocoder.onnx |
DAC decoder β 48 kHz audio (FP16) |
fp32/sidon-predictor.onnx |
predictor (FP32, bit-exact) |
fp32/sidon-vocoder.onnx |
DAC decoder (FP32) |
Quality (no-reference MOS)
FP32 is bit-exact to the PyTorch reference; FP16 is near-lossless (waveform cos 0.999).
| audio | DNSMOS OVRL | UTMOS | speaker cos |
|---|---|---|---|
| input (reverberant) | 2.90 | 2.99 | β |
| restored (FP32 β FP16) | 3.29 | 3.40 | 0.795 |
Restoration lifts OVRL 2.90 β 3.29, driven by BAK 3.40 β 4.11 (reverb removed).
Front-end
The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T
feature extractor (16 kHz). The ONNX graphs use a dynamic time axis (chunk in the
runtime). Front-end + chunking are handled by the consuming engine
(speech-core).
Usage
import onnxruntime as ort, numpy as np
pred = ort.InferenceSession("fp16/sidon-predictor.onnx")
voc = ort.InferenceSession("fp16/sidon-vocoder.onnx")
# input_features [1, T, 160] from the w2v-BERT 2.0 feature extractor (16 kHz)
feats = pred.run(None, {"input_features": input_features})[0]
audio = voc.run(None, {"features": feats})[0] # [1, M] @ 48 kHz
Source
Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.
Links
- speech-core β C++ engine (ONNX backend)
- speech-android β Android SDK
- soniqo.audio β website
- blog
- Downloads last month
- 53