voiceclonnx — pure-ONNX voice conversion
Collection
ONNX exports powering the vconnx voice-conversion library: one repo per engine, with parity reports and provenance. • 10 items • Updated
ONNX artifacts for the FACodec (NaturalSpeech 3) voice-conversion engine, part of vconnx.
Upstream weights: Apache-2.0 — amphion/naturalspeech3_facodec.
Code: Amphion (open-mmlab/Amphion) — Apache-2.0.
ONNX exports: Apache-2.0 (same upstream license, stated here per vconnx publish-all policy).
FACodec (Ju et al., NaturalSpeech 3, ICML 2024) disentangles speech into content, prosody, timbre, and acoustic detail subspaces. Voice conversion is zero-shot: encode source → quantize → swap reference timbre embedding → decode.
| Component | Input → Output | Description |
|---|---|---|
| wav(1,1,N) → enc_feats(1,256,T) | Convolutional encoder (hop=200) | |
| enc_feats(1,256,T) → spk_embs(1,256) | TransformerEncoder timbre extractor | |
| (enc_feats, mel_20(1,20,T)) → vq_ids(6,1,T) | Factorised VQ — prosody(1)+content(2)+residual(3) | |
| (vq_ids, spk_embs) → wav(1,1,N) | vq2emb + AdaIN + conv decoder |
| Component | max_abs Δ | mean_abs Δ | Verdict |
|---|---|---|---|
| encoder | 1.62e-05 | 2.36e-06 | PASS |
| timbre | 1.43e-06 | 6.40e-08 | PASS |
| quantize | exact int64 match | — | PASS |
| decoder | 7.50e-09 | 1.46e-09 | PASS |
| File | Size |
|---|---|
| (fp32) | 16.5 MB |
| (INT8) | 4.7 MB |
| (fp32) | 33.0 MB |
| (INT8) | 12.1 MB |
| (fp32) | 33.4 MB |
| (INT8) | 12.5 MB |
| (fp32) | 66.2 MB |
| (INT8) | 36.8 MB |
Tested with edge-tts source (en-US-GuyNeural) converted to two reference voices (en-US-AriaNeural, en-GB-SoniaNeural) using Whisper base.en transcription.
| Reference voice | WER | Gate |
|---|---|---|
| en-US-AriaNeural | 0% | ✓ PASS |
| en-GB-SoniaNeural | 0% | ✓ PASS |