Chatterbox-Multilingual T3 GGUF

End-to-end GGUF conversion of ResembleAI's Chatterbox multilingual T3 (t3_mtl23ls_v3.safetensors — 23-language coverage).

This release adopts the native codec_lm split used by CSM / Qwen3-TTS / MOSS-TTSD:

Backbone (chatterbox-mtl-t3-<quant>.gguf) — stock llama arch GGUF of Chatterbox T3's Llama-520M tfmr.* weights (30 layers, hidden 1024, 16 heads, head_dim=64, MLP 4096, llama3 RoPE scaling, rope_theta=500000, vocab placeholder = tokenizer.ggml.model = "none"). Runs in stock llama.cpp with embeddings=true.
Codec + codec_lm (chatterbox-mtl-codec-<quant>.gguf) — Chatterbox S3G (flow-matching decoder + HiFi-GAN vocoder) bundled with the T3 LM-adaptor side (audio embed table, speech head, text embed/head, learned positional embeddings, cond encoder weights). Runs in codec.cpp as a parallel_heads_delay codec_lm with n_cb=1.
S3T tokenizer (chatterbox-mtl-s3t.gguf) — speech tokenizer needed when registering a custom voice from a reference WAV.

Compared to the earlier release in this repo, the old t3-*.gguf (custom shape) and t3-extras.gguf are gone — everything the host runtime needs to drive the LM adaptor is now bundled into chatterbox-mtl-codec-*.gguf and exposed through codec.cpp's codec_lm API.

Files

Backbone (`chatterbox-mtl-t3-<quant>.gguf`)

File	Size
`chatterbox-mtl-t3-f32.gguf`	1.9 GB
`chatterbox-mtl-t3-f16.gguf`	961 MB
`chatterbox-mtl-t3-bf16.gguf`	961 MB
`chatterbox-mtl-t3-q8_0.gguf`	511 MB
`chatterbox-mtl-t3-q6_k.gguf`	395 MB
`chatterbox-mtl-t3-q5_1.gguf`	361 MB
`chatterbox-mtl-t3-q5_k_m.gguf`	340 MB
`chatterbox-mtl-t3-q5_k_s.gguf`	331 MB
`chatterbox-mtl-t3-q5_0.gguf`	331 MB
`chatterbox-mtl-t3-q4_1.gguf`	301 MB
`chatterbox-mtl-t3-q4_k_m.gguf`	289 MB
`chatterbox-mtl-t3-q4_k_s.gguf`	273 MB
`chatterbox-mtl-t3-q4_0.gguf`	271 MB
`chatterbox-mtl-t3-q3_k_l.gguf`	254 MB
`chatterbox-mtl-t3-q3_k_m.gguf`	232 MB
`chatterbox-mtl-t3-q3_k_s.gguf`	207 MB
`chatterbox-mtl-t3-q2_k.gguf`	177 MB

Codec + codec_lm (`chatterbox-mtl-codec-<quant>.gguf`)

File	Size
`chatterbox-mtl-codec-f32.gguf`	572 MB
`chatterbox-mtl-codec-f16.gguf`	317 MB
`chatterbox-mtl-codec-q8_0.gguf`	226 MB
`chatterbox-mtl-codec-q5_k_m.gguf`	190 MB
`chatterbox-mtl-codec-q4_k_m.gguf`	178 MB

S3T speech tokenizer

chatterbox-mtl-s3t.gguf (F16, 237 MB) — needed for voice-clone, encodes a reference WAV into the speech token IDs that the codec_lm consumes when registering a custom speaker. Same weights as English Chatterbox.

Inference shape

Per-frame AR loop, single-codebook parallel_heads_delay:

backbone (Llama-520M, embeddings=true) hidden h
    → codec_lm_step_begin(state, h)
    → codec_lm_step_logits(0) → sample speech-token → codec_lm_step_push_code
    → codec_lm_step_finish → codes[1]
    → codec_lm_compose_audio_embd(codes) + speech_pos_emb[step] → next-step embedding
    → feed via b.embd; loop until stop_speech_token (6562)

Prompt prefix assembly (text + cond) lives inside codec.cpp's chatterbox path; the host application doesn't directly touch lm.chatterbox.* tensors.

Sources

Upstream model: ResembleAI/chatterbox (t3_mtl23ls_v3.safetensors)
Conversion tooling: mybigday/codec.cpp (prep_chatterbox_t3 + lm_adaptor/chatterbox.py + ChatterboxS3GConverter with lm_source)
Inference runtime: mybigday/llama.rn

Supported languages

Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese (23 total).

Downloads last month: 1,354

GGUF

Model size

0.2B params

Architecture

chatterbox_s3g

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

View +1 variant

Model tree for hans00/Chatterbox-Multilingual-TTS-GGUF

Base model

ResembleAI/chatterbox

Quantized

(22)

this model