TigreGotico/vconnx-chatterbox

ONNX artifacts for the Chatterbox VC (voice-conversion) engine used by vconnx.

Provenance

Derived from onnx-community/chatterbox-onnx (Apache-2.0) via the vconnx export pipeline. Only the VC path is present — the LLM / tokenizer components are not included because the VC pipeline does not require them.

Original model: Chatterbox by Resemble AI (resemble-ai/chatterbox).

Files

File	Size	Variant	Notes
`speech_encoder.onnx` + `.onnx_data`	~565 MB	fp32	From onnx-community upstream
`speech_encoder_q8.onnx`	~216 MB	INT8	Dynamic quantization (MatMul only)
`conditional_decoder.onnx` + `.onnx_data`	~516 MB	fp32	From onnx-community upstream
`conditional_decoder_q8.onnx`	~252 MB	INT8	Dynamic quantization (MatMul only, If-subgraph excluded)

Total fp32: ~1 081 MB → Total INT8: ~468 MB (57 % reduction)

Quantization notes

speech_encoder_q8.onnx:

Quantized via onnxruntime.quantization.quantize_dynamic (MatMul ops only).
The two Gemm nodes in the S3 VQ codebook (project_down) are excluded from quantization because quantize_dynamic decomposes Gemm(transB=1) without transposing the weight — a known ORT preprocessing bug that corrupts the resulting MatMul node. A pre-quantization patch transposes the weight and rewrites those nodes as MatMul+Add before the quantizer runs.
Discrete token outputs match fp32 at runtime (token-selection is robust to the remaining float error in the continuous embeddings).

conditional_decoder_q8.onnx:

Quantized via onnxruntime.quantization.quantize_dynamic (MatMul ops only).
The decoder contains 20 If nodes with subgraph nodes; these subgraph nodes are excluded from quantization to avoid a hang in ORT's session initializer.
WER is identical to fp32 on the vconnx reference clips (8% vs 8%).

WER results

Measured with faster-whisper base.en on the vconnx reference demo clip (10.7 s, source.wav):

Variant	WER	Size (MB)
fp32	8%	1081
INT8	8%	468

Gate: int8 flagged ⚠ when WER > 25% AND > 15 pp worse than fp32. INT8 passes — recommended.

Usage

Install vconnx and use the chatterbox engine with quantized=True:

from vconnx import VoiceCloner

cloner = VoiceCloner(engine="chatterbox", quantized=True)
result = cloner.clone_voice("source.wav", "reference.wav", "output.wav")

License

Apache-2.0 (inherited from upstream onnx-community/chatterbox-onnx).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/voiceclonnx-chatterbox

voiceclonnx — pure-ONNX voice conversion

Collection

ONNX exports powering the vconnx voice-conversion library: one repo per engine, with parity reports and provenance. • 10 items • Updated 7 days ago