Chatterbox Multilingual LiteRT

LiteRT / TensorFlow Lite port of ResembleAI/chatterbox β€” a multilingual zero-shot TTS model with 24 kHz output and voice cloning across 23 languages.

This bundle ships the model as five separate LiteRT graphs plus a manifest and the default-voice conditioning tensor. The on-device worker orchestrates the loop:

t3_prefill ─► t3_step (Γ—N, AR) ─► flow_encoder ─► flow_estimator (Γ—10 Euler) ─► hift

The T3 graphs take token IDs (the text/speech embedding tables + learned positional embeddings are folded into the graphs), so the worker does not need any embedding weights β€” only the tokenizer and the bundled cond_emb.bin. This bundle is deployment-complete for zero-shot (default-voice) TTS; arbitrary-speaker voice cloning is a follow-up (it additionally needs the voice-encoder + S3 speech-tokenizer + cond_enc graphs).

The T3 speech-token language model (a Llama backbone) owns its K/V cache as explicit graph inputs/outputs (not mutated in place), so the runtime owns retry / cancellation / deterministic-replay semantics. Generation is autoregressive over speech tokens; the flow-matching decoder then turns tokens into a mel-spectrogram, and the HiFT vocoder renders 24 kHz PCM.

Part of soniqo.audio β€” an on-device speech toolkit. Chatterbox is supported today in the Apple/MLX runtime; this LiteRT bundle is the Android/C++ export target and still needs runtime wrapper wiring before it can be used through speech-core or speech-android.

Status: experimental. The t3_step / t3_prefill graphs depend on ai_edge_torch static K/V-cache lowering; integrators should validate numerical parity end-to-end on the target device before relying on this bundle in production.

Capabilities

  • 23 languages including Arabic, English, Spanish, French, German, Chinese, Japanese, Korean, Hindi, Russian
  • 24 kHz output
  • Zero-shot synthesis β€” generate speech from text alone
  • Voice cloning β€” clone a target speaker from a single reference clip
  • Emotion / exaggeration control β€” upstream exaggeration knob

Pipeline graphs

File Variant Role
chatterbox-t3-prefill.tflite INT8 weights / FP32 compute (text_token_ids[1,478], cond_emb[1,34,1024]) β†’ embeds text + pos internally, prepends cond_emb β†’ logits + initial K/V cache. Fixed total length 512.
chatterbox-t3-step.tflite INT8 weights / FP32 compute One AR step: (speech_token_id, speech_pos_idx, cache_position, K/V cache) β†’ embeds the speech token + pos internally β†’ next-token logits + updated K/V cache. BOS (6561) is step 0. Fixed-max cache 1536.
chatterbox-flow-encoder.tflite FP32 Upsampling conformer encoder: speech tokens β†’ conditioning features mu + mask. Runs once per utterance. Fixed-max 512 tokens.
chatterbox-flow-estimator.tflite FP32 Conditional-flow-matching DiT estimator; run 10Γ— inside the Euler ODE solver with CFG. Kept FP32 β€” quantising the estimator turns the noise-like texture of consonants into tonal structure.
chatterbox-hift.tflite FP32 HiFT vocoder (iSTFT head): mel β†’ 24 kHz PCM. Kept FP32 β€” vocoder phase reconstruction is precision-sensitive.
cond_emb.bin float32 Default-voice conditioning, [1, 34, 1024]. Passed to t3_prefill for zero-shot TTS.
config.json β€” Manifest: tensor signatures, sample rates, cache contract, default decoding params, file mapping.
grapheme_mtl_merged_expanded_v1.json / Cangjie5_TC.json β€” Multilingual BPE tokenizer + Chinese Cangjie map.

Only the two large T3 language-model graphs are quantised (INT8 weight-only / FP32 compute) β€” that is where the size lives (2 GB β†’ ~0.5 GB each) and it runs on a plain CPU interpreter. The flow encoder, the conditional-flow-matching estimator and the HiFT vocoder are kept FP32: the audio path is precision-sensitive (the same lesson as the VoxCPM2 port β€” sub-FP32 in the flow/vocoder produces sandy/metallic artefacts), and these graphs are cheap (the encoder runs once; the vocoder once).

K/V-cache contract

t3_prefill and t3_step take/return the Llama K/V cache as explicit tensors. Per-graph layout, one tensor per layer (30 layers), keys then values:

key[i]   : [1, 16, cache_len, 64]      # 16 heads (no GQA), head_dim 64
value[i] : [1, 16, cache_len, 64]
  • t3_prefill writes positions 0 … prefix_len-1 of a length-512 cache, where prefix_len = 34 (cond) + true_text_len. Text is end-padded to 478, so positions past prefix_len are padding and are not used.
  • The worker copies the first prefix_len positions into the length-1536 t3_step buffer, then runs the AR loop from cache_position = prefix_len (step 0 = BOS token 6561, speech_pos_idx = 0).
  • Graphs never mutate the cache buffers in place β€” they emit updated tensors the worker swaps. This is what makes mid-generation cancellation and deterministic replay possible from C++.

CFG (classifier-free guidance)

Chatterbox uses CFG with cfg_weight (default 0.5): logits = cond + wΒ·(cond βˆ’ uncond). The graphs are exported at batch = 1; the worker runs the conditional and unconditional streams as two calls (two caches) and combines the logits host-side.

Default decoding parameters

Parameter Default
prefill total (cond 34 + text-pad 478) 512
max_cache_tokens (step) 1536
max_generated_tokens 1000
cfg_weight 0.5
temperature 0.8
flow_inference_steps (Euler) 10
Sample rate (output) 24 000 Hz

These mirror the host-side defaults in config.json; runtimes may override them.

Validation

  • Per-component round-trip vs PyTorch (FP32): t3_prefill logits MAE ~5e-6 (argmax 100%), t3_step MAE ~2e-6, flow encoder MAE ~0, flow estimator MAE ~3e-6, HiFT vocoder MAE ~4e-6 (corr 1.0000).
  • INT8 vs FP32 logits (T3 graphs): step ~0 relative, prefill ~3.8% relative (the folded embedding tables are quantised) β€” size-optimized; validate parity on the target device before production.
  • End-to-end audio path (flow encoder β†’ 10Γ— estimator Euler β†’ HiFT, all LiteRT, inside the real generate loop): synthesis WER identical to PyTorch for English (0.0 == 0.0) and Arabic (0.0 == 0.0).
  • Performance (dev-box CPU, 4 threads β€” not a phone): INT8 real-time factor β‰ˆ 0.95 for a 5 s utterance (β‰ˆ realtime); FP32 β‰ˆ 2.3. The T3 autoregressive loop dominates (85% of wall-clock). INT8 weight-only gives ~4Γ— size reduction and ~2.4Γ— speed-up over FP32 on the T3 graphs.

Source

Converted from the upstream PyTorch weights at ResembleAI/chatterbox using ai_edge_torch (torch.export / dynamo) with ai_edge_quantizer weight-only INT8 (dynamic_legacy_wi8_afp32) on the two T3 LM graphs, and the flow encoder + estimator + HiFT vocoder held at FP32. The text/speech embedding tables and learned positional embeddings are folded into the T3 graphs so they take token IDs. STFT/iSTFT in the vocoder were re-expressed as fixed-basis Conv1d / ConvTranspose1d (DFT) for export; the K/V cache is lowered as graph IO via a functional static-cache update. The default-voice cond_emb.bin is prepare_conditioning(conds.t3) of the upstream default conditioning.

Links

License

MIT (inherited from upstream ResembleAI/chatterbox).

Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Chatterbox-LiteRT

Finetuned
(57)
this model

Collection including soniqo/Chatterbox-LiteRT