Instructions to use soniqo/Chatterbox-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use soniqo/Chatterbox-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Chatterbox
How to use soniqo/Chatterbox-LiteRT with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
Chatterbox Multilingual LiteRT
LiteRT / TensorFlow Lite port of ResembleAI/chatterbox β a multilingual zero-shot TTS model with 24 kHz output and voice cloning across 23 languages.
This bundle ships the model as five separate LiteRT graphs plus a manifest and the default-voice conditioning tensor. The on-device worker orchestrates the loop:
t3_prefill ββΊ t3_step (ΓN, AR) ββΊ flow_encoder ββΊ flow_estimator (Γ10 Euler) ββΊ hift
The T3 graphs take token IDs (the text/speech embedding tables + learned positional embeddings are folded into the graphs), so the worker does not need any embedding weights β only the tokenizer and the bundled cond_emb.bin. This bundle is deployment-complete for zero-shot (default-voice) TTS; arbitrary-speaker voice cloning is a follow-up (it additionally needs the voice-encoder + S3 speech-tokenizer + cond_enc graphs).
The T3 speech-token language model (a Llama backbone) owns its K/V cache as explicit graph inputs/outputs (not mutated in place), so the runtime owns retry / cancellation / deterministic-replay semantics. Generation is autoregressive over speech tokens; the flow-matching decoder then turns tokens into a mel-spectrogram, and the HiFT vocoder renders 24 kHz PCM.
Part of soniqo.audio β an on-device speech toolkit.
Chatterbox is supported today in the Apple/MLX runtime; this LiteRT bundle is
the Android/C++ export target and still needs runtime wrapper wiring before it
can be used through speech-core or speech-android.
Status: experimental. The
t3_step/t3_prefillgraphs depend onai_edge_torchstatic K/V-cache lowering; integrators should validate numerical parity end-to-end on the target device before relying on this bundle in production.
Capabilities
- 23 languages including Arabic, English, Spanish, French, German, Chinese, Japanese, Korean, Hindi, Russian
- 24 kHz output
- Zero-shot synthesis β generate speech from text alone
- Voice cloning β clone a target speaker from a single reference clip
- Emotion / exaggeration control β upstream
exaggerationknob
Pipeline graphs
| File | Variant | Role |
|---|---|---|
chatterbox-t3-prefill.tflite |
INT8 weights / FP32 compute | (text_token_ids[1,478], cond_emb[1,34,1024]) β embeds text + pos internally, prepends cond_emb β logits + initial K/V cache. Fixed total length 512. |
chatterbox-t3-step.tflite |
INT8 weights / FP32 compute | One AR step: (speech_token_id, speech_pos_idx, cache_position, K/V cache) β embeds the speech token + pos internally β next-token logits + updated K/V cache. BOS (6561) is step 0. Fixed-max cache 1536. |
chatterbox-flow-encoder.tflite |
FP32 | Upsampling conformer encoder: speech tokens β conditioning features mu + mask. Runs once per utterance. Fixed-max 512 tokens. |
chatterbox-flow-estimator.tflite |
FP32 | Conditional-flow-matching DiT estimator; run 10Γ inside the Euler ODE solver with CFG. Kept FP32 β quantising the estimator turns the noise-like texture of consonants into tonal structure. |
chatterbox-hift.tflite |
FP32 | HiFT vocoder (iSTFT head): mel β 24 kHz PCM. Kept FP32 β vocoder phase reconstruction is precision-sensitive. |
cond_emb.bin |
float32 | Default-voice conditioning, [1, 34, 1024]. Passed to t3_prefill for zero-shot TTS. |
config.json |
β | Manifest: tensor signatures, sample rates, cache contract, default decoding params, file mapping. |
grapheme_mtl_merged_expanded_v1.json / Cangjie5_TC.json |
β | Multilingual BPE tokenizer + Chinese Cangjie map. |
Only the two large T3 language-model graphs are quantised (INT8 weight-only / FP32 compute) β that is where the size lives (2 GB β ~0.5 GB each) and it runs on a plain CPU interpreter. The flow encoder, the conditional-flow-matching estimator and the HiFT vocoder are kept FP32: the audio path is precision-sensitive (the same lesson as the VoxCPM2 port β sub-FP32 in the flow/vocoder produces sandy/metallic artefacts), and these graphs are cheap (the encoder runs once; the vocoder once).
K/V-cache contract
t3_prefill and t3_step take/return the Llama K/V cache as explicit tensors. Per-graph layout, one tensor per layer (30 layers), keys then values:
key[i] : [1, 16, cache_len, 64] # 16 heads (no GQA), head_dim 64
value[i] : [1, 16, cache_len, 64]
t3_prefillwrites positions0 β¦ prefix_len-1of a length-512 cache, whereprefix_len = 34 (cond) + true_text_len. Text is end-padded to 478, so positions pastprefix_lenare padding and are not used.- The worker copies the first
prefix_lenpositions into the length-1536t3_stepbuffer, then runs the AR loop fromcache_position = prefix_len(step 0 = BOS token 6561,speech_pos_idx = 0). - Graphs never mutate the cache buffers in place β they emit updated tensors the worker swaps. This is what makes mid-generation cancellation and deterministic replay possible from C++.
CFG (classifier-free guidance)
Chatterbox uses CFG with cfg_weight (default 0.5): logits = cond + wΒ·(cond β uncond). The graphs are exported at batch = 1; the worker runs the conditional and unconditional streams as two calls (two caches) and combines the logits host-side.
Default decoding parameters
| Parameter | Default |
|---|---|
| prefill total (cond 34 + text-pad 478) | 512 |
max_cache_tokens (step) |
1536 |
max_generated_tokens |
1000 |
cfg_weight |
0.5 |
temperature |
0.8 |
flow_inference_steps (Euler) |
10 |
| Sample rate (output) | 24 000 Hz |
These mirror the host-side defaults in config.json; runtimes may override them.
Validation
- Per-component round-trip vs PyTorch (FP32): t3_prefill logits MAE ~5e-6 (argmax 100%), t3_step MAE ~2e-6, flow encoder MAE ~0, flow estimator MAE ~3e-6, HiFT vocoder MAE ~4e-6 (corr 1.0000).
- INT8 vs FP32 logits (T3 graphs): step ~0 relative, prefill ~3.8% relative (the folded embedding tables are quantised) β size-optimized; validate parity on the target device before production.
- End-to-end audio path (flow encoder β 10Γ estimator Euler β HiFT, all LiteRT, inside the real generate loop): synthesis WER identical to PyTorch for English (0.0 == 0.0) and Arabic (0.0 == 0.0).
- Performance (dev-box CPU, 4 threads β not a phone): INT8 real-time factor β 0.95 for a
5 s utterance (β realtime); FP32 β 2.3. The T3 autoregressive loop dominates (85% of wall-clock). INT8 weight-only gives ~4Γ size reduction and ~2.4Γ speed-up over FP32 on the T3 graphs.
Source
Converted from the upstream PyTorch weights at ResembleAI/chatterbox using ai_edge_torch (torch.export / dynamo) with ai_edge_quantizer weight-only INT8 (dynamic_legacy_wi8_afp32) on the two T3 LM graphs, and the flow encoder + estimator + HiFT vocoder held at FP32. The text/speech embedding tables and learned positional embeddings are folded into the T3 graphs so they take token IDs. STFT/iSTFT in the vocoder were re-expressed as fixed-basis Conv1d / ConvTranspose1d (DFT) for export; the K/V cache is lowered as graph IO via a functional static-cache update. The default-voice cond_emb.bin is prepare_conditioning(conds.t3) of the upstream default conditioning.
Links
- speech-swift β Apple SDK with the MLX Chatterbox runtime
- Speech Studio β local speech generation and voice cloning app
- Docs β install and CLI docs
- soniqo.audio β website
- blog β blog
License
MIT (inherited from upstream ResembleAI/chatterbox).
Responsible use
Voice cloning capability is included. Users are responsible for obtaining consent for any voice that is cloned and for not using the model to impersonate individuals without their permission, generate disinformation, or commit fraud.
- Downloads last month
- 27
Model tree for soniqo/Chatterbox-LiteRT
Base model
ResembleAI/chatterbox