Sidon β Core ML (speech restoration / dereverberation)
On-device speech restoration (denoise + dereverberation + bandwidth restoration) for Apple Silicon, exported to Core ML (runs on the Neural Engine). Turns a noisy/reverberant clip into studio-quality 48 kHz speech β ideal for cleaning a voice-cloning reference before TTS, since it preserves speaker identity.
Two-stage pipeline:
16 kHz audio β [w2v-BERT log-mel front-end] β predictor (w2v-BERT 2.0, 8 layers)
β cleansed features [1, T, 1024] β DAC decoder β 48 kHz audio
Variants
| variant | predictor | vocoder | bundle size | peak RAM | RTF (ANE) |
|---|---|---|---|---|---|
| fp16 | FP16 | FP16 | 713 MB | 1711 MB | ~120Γ |
| int8 | INT8 (k-means palettized) | FP16 | 407 MB | 1321 MB | ~110Γ |
Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output sample rate 48 kHz. int8 keeps the vocoder at FP16 (audio quality); only the predictor is palettized.
Files
| path | description |
|---|---|
fp16/Sidon-Predictor.mlpackage |
w2v-BERT 2.0 (8L) + merged LoRA β features (FP16) |
fp16/Sidon-Vocoder.mlpackage |
DAC decoder β 48 kHz audio (FP16) |
int8/Sidon-Predictor.mlpackage |
predictor, 8-bit palettized |
int8/Sidon-Vocoder.mlpackage |
DAC decoder (FP16) |
Quality (no-reference MOS, 10 s clip)
DNSMOS P.835 (SIG/BAK/OVRL, higher = better) and UTMOS (naturalness, 1β5):
| audio | SIG | BAK | OVRL | UTMOS | speaker cos |
|---|---|---|---|---|---|
| input (reverberant) | 3.46 | 3.40 | 2.90 | 2.99 | β |
| fp16 | 3.53 | 4.09 | 3.28 | 3.32 | 0.797 |
| int8 | 3.54 | 4.11 | 3.29 | 3.23 | 0.796 |
Restoration lifts OVRL 2.90 β 3.29 (driven by BAK 3.40 β 4.11 β reverb removed). Quantization is near-lossless on DNSMOS and speaker similarity; UTMOS shows a small naturalness cost (fp16 β0.09, int8 β0.17). Numbers are a single clip β average over a set for a definitive figure.
Front-end
The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T
feature extractor (16 kHz input). The sequence length is fixed (T = 499 β 10 s) β
chunk longer audio in the runtime. The front-end and chunking are handled by
speech-swift.
Usage
Use via the speech-swift Apple SDK, e.g.:
speech enhance noisy-reference.wav -o clean.wav # restore / dereverb on-device
// See speech-swift for the full API (loads the predictor + vocoder, runs the
// log-mel front-end, chunks, and writes 48 kHz audio).
Source
Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.
Links
- speech-swift β Apple (Swift) SDK
- soniqo.audio β website
- blog
- Downloads last month
- 36