Sidon — Core ML (speech restoration / dereverberation)

On-device speech restoration (denoise + dereverberation + bandwidth restoration) for Apple Silicon, exported to Core ML (runs on the Neural Engine). Turns a noisy/reverberant clip into studio-quality 48 kHz speech — ideal for cleaning a voice-cloning reference before TTS, since it preserves speaker identity.

Two-stage pipeline:

16 kHz audio → [w2v-BERT log-mel front-end] → predictor (w2v-BERT 2.0, 8 layers)
            → cleansed features [1, T, 1024] → DAC decoder → 48 kHz audio

Variants

variant	predictor	vocoder	bundle size	peak RAM	RTF (ANE)
fp16	FP16	FP16	713 MB	1711 MB	~120×
int8	INT8 (k-means palettized)	FP16	407 MB	1321 MB	~110×

Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output sample rate 48 kHz. int8 keeps the vocoder at FP16 (audio quality); only the predictor is palettized.

Files

path	description
`fp16/Sidon-Predictor.mlpackage`	w2v-BERT 2.0 (8L) + merged LoRA → features (FP16)
`fp16/Sidon-Vocoder.mlpackage`	DAC decoder → 48 kHz audio (FP16)
`int8/Sidon-Predictor.mlpackage`	predictor, 8-bit palettized
`int8/Sidon-Vocoder.mlpackage`	DAC decoder (FP16)

Quality (no-reference MOS, 10 s clip)

DNSMOS P.835 (SIG/BAK/OVRL, higher = better) and UTMOS (naturalness, 1–5):

audio	SIG	BAK	OVRL	UTMOS	speaker cos
input (reverberant)	3.46	3.40	2.90	2.99	—
fp16	3.53	4.09	3.28	3.32	0.797
int8	3.54	4.11	3.29	3.23	0.796

Restoration lifts OVRL 2.90 → 3.29 (driven by BAK 3.40 → 4.11 — reverb removed). Quantization is near-lossless on DNSMOS and speaker similarity; UTMOS shows a small naturalness cost (fp16 −0.09, int8 −0.17). Numbers are a single clip — average over a set for a definitive figure.

Front-end

The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T feature extractor (16 kHz input). The sequence length is fixed (T = 499 ≈ 10 s) — chunk longer audio in the runtime. The front-end and chunking are handled by speech-swift.

Usage

Use via the speech-swift Apple SDK, e.g.:

speech enhance noisy-reference.wav -o clean.wav   # restore / dereverb on-device

// See speech-swift for the full API (loads the predictor + vocoder, runs the
// log-mel front-end, chunks, and writes 48 kHz audio).

Source

Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.