Sidon β€” Core ML (speech restoration / dereverberation)

On-device speech restoration (denoise + dereverberation + bandwidth restoration) for Apple Silicon, exported to Core ML (runs on the Neural Engine). Turns a noisy/reverberant clip into studio-quality 48 kHz speech β€” ideal for cleaning a voice-cloning reference before TTS, since it preserves speaker identity.

Two-stage pipeline:

16 kHz audio β†’ [w2v-BERT log-mel front-end] β†’ predictor (w2v-BERT 2.0, 8 layers)
            β†’ cleansed features [1, T, 1024] β†’ DAC decoder β†’ 48 kHz audio

Variants

variant predictor vocoder bundle size peak RAM RTF (ANE)
fp16 FP16 FP16 713 MB 1711 MB ~120Γ—
int8 INT8 (k-means palettized) FP16 407 MB 1321 MB ~110Γ—

Total 246 M params (predictor 193.6 M + DAC vocoder 52.4 M). Output sample rate 48 kHz. int8 keeps the vocoder at FP16 (audio quality); only the predictor is palettized.

Files

path description
fp16/Sidon-Predictor.mlpackage w2v-BERT 2.0 (8L) + merged LoRA β†’ features (FP16)
fp16/Sidon-Vocoder.mlpackage DAC decoder β†’ 48 kHz audio (FP16)
int8/Sidon-Predictor.mlpackage predictor, 8-bit palettized
int8/Sidon-Vocoder.mlpackage DAC decoder (FP16)

Quality (no-reference MOS, 10 s clip)

DNSMOS P.835 (SIG/BAK/OVRL, higher = better) and UTMOS (naturalness, 1–5):

audio SIG BAK OVRL UTMOS speaker cos
input (reverberant) 3.46 3.40 2.90 2.99 β€”
fp16 3.53 4.09 3.28 3.32 0.797
int8 3.54 4.11 3.29 3.23 0.796

Restoration lifts OVRL 2.90 β†’ 3.29 (driven by BAK 3.40 β†’ 4.11 β€” reverb removed). Quantization is near-lossless on DNSMOS and speaker similarity; UTMOS shows a small naturalness cost (fp16 βˆ’0.09, int8 βˆ’0.17). Numbers are a single clip β€” average over a set for a definitive figure.

Front-end

The graphs take input_features [1, T, 160] from the w2v-BERT 2.0 SeamlessM4T feature extractor (16 kHz input). The sequence length is fixed (T = 499 β‰ˆ 10 s) β€” chunk longer audio in the runtime. The front-end and chunking are handled by speech-swift.

Usage

Use via the speech-swift Apple SDK, e.g.:

speech enhance noisy-reference.wav -o clean.wav   # restore / dereverb on-device
// See speech-swift for the full API (loads the predictor + vocoder, runs the
// log-mel front-end, chunks, and writes 48 kHz audio).

Source

Exported from Sidon (sarulab-speech), checkpoint sidon-v0.1; paper arXiv:2509.17052. Base SSL encoder: facebook/w2v-bert-2.0; vocoder: DAC (descript-audio-codec). All components are MIT-licensed.

Links

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Sidon-CoreML

Quantized
(2)
this model

Collection including aufklarer/Sidon-CoreML

Paper for aufklarer/Sidon-CoreML