CSM-1B GGUF
GGUF conversion of Sesame's CSM-1B โ a zero-shot conversational TTS model: Llama-3.2-1B backbone + Mimi 32-codebook audio codec + a 4-layer residual-depth-AR decoder that emits 32 RVQ codebook codes per AR step.
The stack splits across two runtimes, each owning the part it's good at:
- Backbone (LLM-part) โ Llama-3.2-1B with CSM's
embed_text_tokensmapped onto the standardmodel.embed_tokens.weightslot, so plain llama.cpp tokenizes + embeds text natively. Runs in llama.cpp (embeddings=truemode โ hidden state is read, not logits). - Codec + codec_lm (Audio-part) โ Mimi codec (24 kHz mono, 12.5 Hz frame rate) bundled with the
residual_depth_arcodec_lm adaptor (32 audio embed tables + c0 head + 4-layer depth decoder + 31 codebooks_head slices). Runs in codec.cpp.
Inference shape: backbone hidden state โ codec_lm_step_* state machine (1 c0 head + 31 depth-AR steps per frame โ 32 codes) โ codec_lm_compose_audio_embd โ fed back into backbone as the next position's input embedding. Stop on codes[0] == 0 at step > 0 (training-time audio-EOS marker).
Files
Backbone (llama arch, vocab 128256, hidden 2048, 28 layers)
csm-1b-<quant>.gguf
| File | Size |
|---|---|
csm-1b-f32.gguf |
4.61 GB |
csm-1b-f16.gguf |
2.31 GB |
csm-1b-bf16.gguf |
2.31 GB |
csm-1b-q8_0.gguf |
1.23 GB |
csm-1b-q6_k.gguf |
974 MB |
csm-1b-q5_1.gguf |
909 MB |
csm-1b-q5_k_m.gguf |
869 MB |
csm-1b-q5_k_s.gguf |
851 MB |
csm-1b-q5_0.gguf |
851 MB |
csm-1b-q4_1.gguf |
793 MB |
csm-1b-q4_k_m.gguf |
770 MB |
csm-1b-q4_k_s.gguf |
739 MB |
csm-1b-q4_0.gguf |
735 MB |
csm-1b-q3_k_l.gguf |
698 MB |
csm-1b-q3_k_m.gguf |
659 MB |
csm-1b-q3_k_s.gguf |
612 MB |
csm-1b-q2_k.gguf |
554 MB |
Codec + codec_lm (Mimi, 24 kHz mono, 32 RVQ codebooks ร 2051; residual_depth_ar codec_lm)
codec[-<quant>].gguf
| File | Size |
|---|---|
codec-f32.gguf |
1.11 GB |
codec-f16.gguf |
871 MB |
codec-q8_0.gguf |
803 MB |
codec-q5_k_m.gguf |
776 MB |
codec-q4_k_m.gguf |
767 MB |
Mimi is mostly small conv kernels whose row sizes don't meet the K-quant block-size requirements, so
Q4_K_M/Q5_K_Msave little overQ8_0. For minimum disk + RAM, pair the backbone quants withcodec-q8_0.gguf.
Usage with llama.rn
llama.rn's TTS layer auto-detects this model via the codec.gguf's codec.lm.* metadata and routes through the codec_lm AR path:
import { initLlama, loadLlamaModelInfo } from 'llama.rn'
const ctx = await initLlama({
model: 'csm-1b-q4_k_m.gguf',
vocoder: { path: 'codec-q4_k_m.gguf' },
n_ctx: 4096,
})
const fmt = await ctx.getFormattedAudioCompletion({
prompt: 'Hello, world!',
// CSM is zero-shot โ `speaker: { id: 0 }` or `{ id: 1 }` picks one of the
// two trained speakers. Omit to default to speaker 0.
})
// fmt.flow === 'codec_lm_ar' for CSM.
const { codes } = await ctx.generateAudioCodes({
prompt: fmt.prompt,
maxFrames: 500,
temperature: 0.9,
topP: 0.95,
topK: 50,
})
const pcm = await ctx.decodeAudioTokens(codes)
// pcm is Float32-PCM at 24 kHz; feed it into your audio player of choice.
For users running CLI / parity tests against the HF reference, see examples/tts.py --model csm in codec.cpp.
Notes
- Zero-shot: no speaker config / reference audio is needed. The model was trained on two speakers (IDs
0and1); the prompt format is<|begin_of_text|>[<speaker>]<text><|end_of_text|>. - Voice control: speaker timbre comes from the speaker tag (
[0]vs[1]); finer control isn't exposed by CSM. - License: CSM is released under Apache-2.0 by Sesame AI. See the upstream model card for full terms.
- Tokenizer hash patch: CSM's bundled Llama-3 tokenizer hits an unrecognised BPE pre-tokenizer hash in older
convert_hf_to_gguf.pyversions; codec.cpp'sconvert-backbone-to-gguf.py prep_csminjects a runtime patch mapping unknown hashes tollama-bpe(the regex family is identical, the tokenizer isn't used at codec_lm-driven inference anyway).
Sources
- Original model:
sesame/csm-1b - Conversion tooling:
mybigday/codec.cpp(scripts:convert-backbone-to-gguf.py prep_csm+convert-to-gguf.pywith the auto-dispatchedCsmConverter) - Inference runtime:
mybigday/llama.rn(codec_lm AR path lands incpp/rn-tts.cpp::generateAudioCodes)
- Downloads last month
- 272
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit
Model tree for hans00/CSM-1B-GGUF
Base model
sesame/csm-1b