Sidon — ONNX (speech restoration / dereverberation)

Cross-platform speech restoration (denoise + dereverberation + bandwidth restoration) exported to ONNX, for ONNX Runtime on Linux / Windows / Android / macOS. Turns a noisy/reverberant clip into studio-quality 48 kHz speech — ideal for cleaning a voice-cloning reference before TTS (it preserves speaker identity).

Two-stage pipeline:

16 kHz audio → [w2v-BERT log-mel front-end] → predictor (w2v-BERT 2.0, 8 layers)
            → cleansed features [1, T, 1024] → DAC decoder → 48 kHz audio

Variants

variant	precision	size	parity vs FP32	RTF (CPU)
int8	weight-only int8, per-channel (activations FP)	286 MB	cos 0.96, speaker 0.791 ≈ FP32 0.795	~4×
fp16	FP16 (attention kept FP32)	470 MB	cos 0.999, SI-SDR 28 dB (near-lossless)	~4.2×
fp32	FP32	939 MB	bit-exact	~4.6×

Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output 48 kHz. FP16 uses ORT's transformer-aware converter (attention stays FP32).

int8 is weight-only, per-channel (weights stored int8 + DequantizeLinear; activations stay FP) — predictor 186 MB + FP16 vocoder. This preserves quality (waveform cos 0.96, speaker 0.791 ≈ FP32's 0.795). The key is to never quantize activations on this SSL encoder: activation-quantizing int8 (dynamic / static QDQ), and uniform-per-tensor / 4-bit weight quant, all corrupt it (cos 0.26–0.59) because the CNN/attention activations have outliers the DAC vocoder amplifies. (ORT has no native palette/LUT op; a Gather-codebook can reproduce CoreML's exact k-means weights but ORT constant-folds it to dense at load → disk-only, and per-tensor k-means scored 0.76 here — worse than per-channel int8, so it isn't worth it.)

Files

path	description
`int8/sidon-predictor.onnx`	predictor, weight-only int8 per-channel (DequantizeLinear)
`int8/sidon-vocoder.onnx`	DAC decoder (FP16)
`fp16/sidon-predictor.onnx`	w2v-BERT 2.0 (8L) + merged LoRA → features (FP16)
`fp16/sidon-vocoder.onnx`	DAC decoder → 48 kHz audio (FP16)
`fp32/sidon-predictor.onnx`	predictor (FP32, bit-exact)
`fp32/sidon-vocoder.onnx`	DAC decoder (FP32)

Quality (no-reference MOS)

FP32 is bit-exact to the PyTorch reference; FP16 is near-lossless (waveform cos 0.999).

audio	DNSMOS OVRL	UTMOS	speaker cos
input (reverberant)	2.90	2.99	—
restored (FP32 ≈ FP16)	3.29	3.40	0.795

Restoration lifts OVRL 2.90 → 3.29, driven by BAK 3.40 → 4.11 (reverb removed).

Front-end

The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T feature extractor (16 kHz). The ONNX graphs use a dynamic time axis (chunk in the runtime). Front-end + chunking are handled by the consuming engine (speech-core).

Usage

import onnxruntime as ort, numpy as np
pred = ort.InferenceSession("fp16/sidon-predictor.onnx")
voc  = ort.InferenceSession("fp16/sidon-vocoder.onnx")
# input_features [1, T, 160] from the w2v-BERT 2.0 feature extractor (16 kHz)
feats = pred.run(None, {"input_features": input_features})[0]
audio = voc.run(None, {"features": feats})[0]   # [1, M] @ 48 kHz

Source

Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.