ONNX
voice-conversion
speech
audio

TigreGotico/vconnx-facodec

ONNX artifacts for the FACodec (NaturalSpeech 3) voice-conversion engine, part of vconnx.

License

Upstream weights: Apache-2.0amphion/naturalspeech3_facodec.
Code: Amphion (open-mmlab/Amphion) — Apache-2.0.
ONNX exports: Apache-2.0 (same upstream license, stated here per vconnx publish-all policy).

Architecture

FACodec (Ju et al., NaturalSpeech 3, ICML 2024) disentangles speech into content, prosody, timbre, and acoustic detail subspaces. Voice conversion is zero-shot: encode source → quantize → swap reference timbre embedding → decode.

Component Input → Output Description
wav(1,1,N) → enc_feats(1,256,T) Convolutional encoder (hop=200)
enc_feats(1,256,T) → spk_embs(1,256) TransformerEncoder timbre extractor
(enc_feats, mel_20(1,20,T)) → vq_ids(6,1,T) Factorised VQ — prosody(1)+content(2)+residual(3)
(vq_ids, spk_embs) → wav(1,1,N) vq2emb + AdaIN + conv decoder

Parity (fp32 torch vs ORT)

Component max_abs Δ mean_abs Δ Verdict
encoder 1.62e-05 2.36e-06 PASS
timbre 1.43e-06 6.40e-08 PASS
quantize exact int64 match PASS
decoder 7.50e-09 1.46e-09 PASS

Model sizes

File Size
(fp32) 16.5 MB
(INT8) 4.7 MB
(fp32) 33.0 MB
(INT8) 12.1 MB
(fp32) 33.4 MB
(INT8) 12.5 MB
(fp32) 66.2 MB
(INT8) 36.8 MB

Intelligibility (WER gate ≤ 25%)

Tested with edge-tts source (en-US-GuyNeural) converted to two reference voices (en-US-AriaNeural, en-GB-SoniaNeural) using Whisper base.en transcription.

Reference voice WER Gate
en-US-AriaNeural 0% ✓ PASS
en-GB-SoniaNeural 0% ✓ PASS

References

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/voiceclonnx-facodec

Paper for TigreGotico/voiceclonnx-facodec