TigreGotico/vconnx-chatterbox

ONNX artifacts for the Chatterbox VC (voice-conversion) engine used by vconnx.

Provenance

Derived from onnx-community/chatterbox-onnx (Apache-2.0) via the vconnx export pipeline. Only the VC path is present — the LLM / tokenizer components are not included because the VC pipeline does not require them.

Original model: Chatterbox by Resemble AI (resemble-ai/chatterbox).

Files

File Size Variant Notes
speech_encoder.onnx + .onnx_data ~565 MB fp32 From onnx-community upstream
speech_encoder_q8.onnx ~216 MB INT8 Dynamic quantization (MatMul only)
conditional_decoder.onnx + .onnx_data ~516 MB fp32 From onnx-community upstream
conditional_decoder_q8.onnx ~252 MB INT8 Dynamic quantization (MatMul only, If-subgraph excluded)

Total fp32: ~1 081 MB → Total INT8: ~468 MB (57 % reduction)

Quantization notes

speech_encoder_q8.onnx:

  • Quantized via onnxruntime.quantization.quantize_dynamic (MatMul ops only).
  • The two Gemm nodes in the S3 VQ codebook (project_down) are excluded from quantization because quantize_dynamic decomposes Gemm(transB=1) without transposing the weight — a known ORT preprocessing bug that corrupts the resulting MatMul node. A pre-quantization patch transposes the weight and rewrites those nodes as MatMul+Add before the quantizer runs.
  • Discrete token outputs match fp32 at runtime (token-selection is robust to the remaining float error in the continuous embeddings).

conditional_decoder_q8.onnx:

  • Quantized via onnxruntime.quantization.quantize_dynamic (MatMul ops only).
  • The decoder contains 20 If nodes with subgraph nodes; these subgraph nodes are excluded from quantization to avoid a hang in ORT's session initializer.
  • WER is identical to fp32 on the vconnx reference clips (8% vs 8%).

WER results

Measured with faster-whisper base.en on the vconnx reference demo clip (10.7 s, source.wav):

Variant WER Size (MB)
fp32 8% 1081
INT8 8% 468

Gate: int8 flagged ⚠ when WER > 25% AND > 15 pp worse than fp32. INT8 passes — recommended.

Usage

Install vconnx and use the chatterbox engine with quantized=True:

from vconnx import VoiceCloner

cloner = VoiceCloner(engine="chatterbox", quantized=True)
result = cloner.clone_voice("source.wav", "reference.wav", "output.wav")

License

Apache-2.0 (inherited from upstream onnx-community/chatterbox-onnx).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/voiceclonnx-chatterbox