Chatterbox Multilingual LiteRT

LiteRT / TensorFlow Lite port of ResembleAI/chatterbox — a multilingual zero-shot TTS model with 24 kHz output and voice cloning across 23 languages.

This bundle ships the model as five separate LiteRT graphs plus a manifest and the default-voice conditioning tensor. The on-device worker orchestrates the loop:

t3_prefill ─► t3_step (×N, AR) ─► flow_encoder ─► flow_estimator (×10 Euler) ─► hift

The T3 graphs take token IDs (the text/speech embedding tables + learned positional embeddings are folded into the graphs), so the worker does not need any embedding weights — only the tokenizer and the bundled cond_emb.bin. This bundle is deployment-complete for zero-shot (default-voice) TTS; arbitrary-speaker voice cloning is a follow-up (it additionally needs the voice-encoder + S3 speech-tokenizer + cond_enc graphs).

The T3 speech-token language model (a Llama backbone) owns its K/V cache as explicit graph inputs/outputs (not mutated in place), so the runtime owns retry / cancellation / deterministic-replay semantics. Generation is autoregressive over speech tokens; the flow-matching decoder then turns tokens into a mel-spectrogram, and the HiFT vocoder renders 24 kHz PCM.

Part of soniqo.audio — an on-device speech toolkit. Chatterbox is supported today in the Apple/MLX runtime; this LiteRT bundle is the Android/C++ export target and still needs runtime wrapper wiring before it can be used through speech-core or speech-android.

Status: experimental. The t3_step / t3_prefill graphs depend on ai_edge_torch static K/V-cache lowering; integrators should validate numerical parity end-to-end on the target device before relying on this bundle in production.

Capabilities

23 languages including Arabic, English, Spanish, French, German, Chinese, Japanese, Korean, Hindi, Russian
24 kHz output
Zero-shot synthesis — generate speech from text alone
Voice cloning — clone a target speaker from a single reference clip
Emotion / exaggeration control — upstream exaggeration knob

Pipeline graphs

File	Variant	Role
`chatterbox-t3-prefill.tflite`	INT8 weights / FP32 compute	`(text_token_ids[1,478], cond_emb[1,34,1024])` → embeds text + pos internally, prepends `cond_emb` → logits + initial K/V cache. Fixed total length 512.
`chatterbox-t3-step.tflite`	INT8 weights / FP32 compute	One AR step: `(speech_token_id, speech_pos_idx, cache_position, K/V cache)` → embeds the speech token + pos internally → next-token logits + updated K/V cache. BOS (6561) is step 0. Fixed-max cache 1536.
`chatterbox-flow-encoder.tflite`	FP32	Upsampling conformer encoder: speech tokens → conditioning features `mu` + mask. Runs once per utterance. Fixed-max 512 tokens.
`chatterbox-flow-estimator.tflite`	FP32	Conditional-flow-matching DiT estimator; run 10× inside the Euler ODE solver with CFG. Kept FP32 — quantising the estimator turns the noise-like texture of consonants into tonal structure.
`chatterbox-hift.tflite`	FP32	HiFT vocoder (iSTFT head): mel → 24 kHz PCM. Kept FP32 — vocoder phase reconstruction is precision-sensitive.
`cond_emb.bin`	float32	Default-voice conditioning, `[1, 34, 1024]`. Passed to `t3_prefill` for zero-shot TTS.
`config.json`	—	Manifest: tensor signatures, sample rates, cache contract, default decoding params, file mapping.
`grapheme_mtl_merged_expanded_v1.json` / `Cangjie5_TC.json`	—	Multilingual BPE tokenizer + Chinese Cangjie map.

Only the two large T3 language-model graphs are quantised (INT8 weight-only / FP32 compute) — that is where the size lives (2 GB → ~0.5 GB each) and it runs on a plain CPU interpreter. The flow encoder, the conditional-flow-matching estimator and the HiFT vocoder are kept FP32: the audio path is precision-sensitive (the same lesson as the VoxCPM2 port — sub-FP32 in the flow/vocoder produces sandy/metallic artefacts), and these graphs are cheap (the encoder runs once; the vocoder once).

K/V-cache contract

t3_prefill and t3_step take/return the Llama K/V cache as explicit tensors. Per-graph layout, one tensor per layer (30 layers), keys then values:

key[i]   : [1, 16, cache_len, 64]      # 16 heads (no GQA), head_dim 64
value[i] : [1, 16, cache_len, 64]

t3_prefill writes positions 0 … prefix_len-1 of a length-512 cache, where prefix_len = 34 (cond) + true_text_len. Text is end-padded to 478, so positions past prefix_len are padding and are not used.
The worker copies the first prefix_len positions into the length-1536 t3_step buffer, then runs the AR loop from cache_position = prefix_len (step 0 = BOS token 6561, speech_pos_idx = 0).
Graphs never mutate the cache buffers in place — they emit updated tensors the worker swaps. This is what makes mid-generation cancellation and deterministic replay possible from C++.

CFG (classifier-free guidance)

Chatterbox uses CFG with cfg_weight (default 0.5): logits = cond + w·(cond − uncond). The graphs are exported at batch = 1; the worker runs the conditional and unconditional streams as two calls (two caches) and combines the logits host-side.

Default decoding parameters

Parameter	Default
prefill total (cond 34 + text-pad 478)	512
`max_cache_tokens` (step)	1536
`max_generated_tokens`	1000
`cfg_weight`	0.5
`temperature`	0.8
`flow_inference_steps` (Euler)	10
Sample rate (output)	24 000 Hz

These mirror the host-side defaults in config.json; runtimes may override them.

Validation

Per-component round-trip vs PyTorch (FP32): t3_prefill logits MAE ~5e-6 (argmax 100%), t3_step MAE ~2e-6, flow encoder MAE ~0, flow estimator MAE ~3e-6, HiFT vocoder MAE ~4e-6 (corr 1.0000).
INT8 vs FP32 logits (T3 graphs): step ~0 relative, prefill ~3.8% relative (the folded embedding tables are quantised) — size-optimized; validate parity on the target device before production.
End-to-end audio path (flow encoder → 10× estimator Euler → HiFT, all LiteRT, inside the real generate loop): synthesis WER identical to PyTorch for English (0.0 == 0.0) and Arabic (0.0 == 0.0).
Performance (dev-box CPU, 4 threads — not a phone): INT8 real-time factor ≈ 0.95 for a ~~5 s utterance (≈ realtime); FP32 ≈ 2.3. The T3 autoregressive loop dominates (~~85% of wall-clock). INT8 weight-only gives ~4× size reduction and ~2.4× speed-up over FP32 on the T3 graphs.

Source

Converted from the upstream PyTorch weights at ResembleAI/chatterbox using ai_edge_torch (torch.export / dynamo) with ai_edge_quantizer weight-only INT8 (dynamic_legacy_wi8_afp32) on the two T3 LM graphs, and the flow encoder + estimator + HiFT vocoder held at FP32. The text/speech embedding tables and learned positional embeddings are folded into the T3 graphs so they take token IDs. STFT/iSTFT in the vocoder were re-expressed as fixed-basis Conv1d / ConvTranspose1d (DFT) for export; the K/V cache is lowered as graph IO via a functional static-cache update. The default-voice cond_emb.bin is prepare_conditioning(conds.t3) of the upstream default conditioning.

License

MIT (inherited from upstream ResembleAI/chatterbox).

Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.

Downloads last month: 27

Model tree for soniqo/Chatterbox-LiteRT

Base model

ResembleAI/chatterbox

Finetuned

(57)

this model

Collection including soniqo/Chatterbox-LiteRT

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 13 items • Updated 3 days ago • 1

soniqo
/

Chatterbox-LiteRT