CSM-1B GGUF

GGUF conversion of Sesame's CSM-1B — a zero-shot conversational TTS model: Llama-3.2-1B backbone + Mimi 32-codebook audio codec + a 4-layer residual-depth-AR decoder that emits 32 RVQ codebook codes per AR step.

The stack splits across two runtimes, each owning the part it's good at:

Backbone (LLM-part) — Llama-3.2-1B with CSM's embed_text_tokens mapped onto the standard model.embed_tokens.weight slot, so plain llama.cpp tokenizes + embeds text natively. Runs in llama.cpp (embeddings=true mode — hidden state is read, not logits).
Codec + codec_lm (Audio-part) — Mimi codec (24 kHz mono, 12.5 Hz frame rate) bundled with the residual_depth_ar codec_lm adaptor (32 audio embed tables + c0 head + 4-layer depth decoder + 31 codebooks_head slices). Runs in codec.cpp.

Inference shape: backbone hidden state → codec_lm_step_* state machine (1 c0 head + 31 depth-AR steps per frame → 32 codes) → codec_lm_compose_audio_embd → fed back into backbone as the next position's input embedding. Stop on codes[0] == 0 at step > 0 (training-time audio-EOS marker).

Files

Backbone (llama arch, vocab 128256, hidden 2048, 28 layers)

csm-1b-<quant>.gguf

File	Size
`csm-1b-f32.gguf`	4.61 GB
`csm-1b-f16.gguf`	2.31 GB
`csm-1b-bf16.gguf`	2.31 GB
`csm-1b-q8_0.gguf`	1.23 GB
`csm-1b-q6_k.gguf`	974 MB
`csm-1b-q5_1.gguf`	909 MB
`csm-1b-q5_k_m.gguf`	869 MB
`csm-1b-q5_k_s.gguf`	851 MB
`csm-1b-q5_0.gguf`	851 MB
`csm-1b-q4_1.gguf`	793 MB
`csm-1b-q4_k_m.gguf`	770 MB
`csm-1b-q4_k_s.gguf`	739 MB
`csm-1b-q4_0.gguf`	735 MB
`csm-1b-q3_k_l.gguf`	698 MB
`csm-1b-q3_k_m.gguf`	659 MB
`csm-1b-q3_k_s.gguf`	612 MB
`csm-1b-q2_k.gguf`	554 MB

Codec + codec_lm (Mimi, 24 kHz mono, 32 RVQ codebooks × 2051; `residual_depth_ar` codec_lm)

codec[-<quant>].gguf

File	Size
`codec-f32.gguf`	1.11 GB
`codec-f16.gguf`	871 MB
`codec-q8_0.gguf`	803 MB
`codec-q5_k_m.gguf`	776 MB
`codec-q4_k_m.gguf`	767 MB

Mimi is mostly small conv kernels whose row sizes don't meet the K-quant block-size requirements, so Q4_K_M / Q5_K_M save little over Q8_0. For minimum disk + RAM, pair the backbone quants with codec-q8_0.gguf.

Usage with llama.rn

llama.rn's TTS layer auto-detects this model via the codec.gguf's codec.lm.* metadata and routes through the codec_lm AR path:

import { initLlama, loadLlamaModelInfo } from 'llama.rn'

const ctx = await initLlama({
  model: 'csm-1b-q4_k_m.gguf',
  vocoder: { path: 'codec-q4_k_m.gguf' },
  n_ctx: 4096,
})

const fmt = await ctx.getFormattedAudioCompletion({
  prompt: 'Hello, world!',
  // CSM is zero-shot — `speaker: { id: 0 }` or `{ id: 1 }` picks one of the
  // two trained speakers.  Omit to default to speaker 0.
})

// fmt.flow === 'codec_lm_ar' for CSM.
const { codes } = await ctx.generateAudioCodes({
  prompt: fmt.prompt,
  maxFrames: 500,
  temperature: 0.9,
  topP: 0.95,
  topK: 50,
})

const pcm = await ctx.decodeAudioTokens(codes)
// pcm is Float32-PCM at 24 kHz; feed it into your audio player of choice.

For users running CLI / parity tests against the HF reference, see examples/tts.py --model csm in codec.cpp.

Notes

Zero-shot: no speaker config / reference audio is needed. The model was trained on two speakers (IDs 0 and 1); the prompt format is <|begin_of_text|>[<speaker>]<text><|end_of_text|>.
Voice control: speaker timbre comes from the speaker tag ([0] vs [1]); finer control isn't exposed by CSM.
License: CSM is released under Apache-2.0 by Sesame AI. See the upstream model card for full terms.
Tokenizer hash patch: CSM's bundled Llama-3 tokenizer hits an unrecognised BPE pre-tokenizer hash in older convert_hf_to_gguf.py versions; codec.cpp's convert-backbone-to-gguf.py prep_csm injects a runtime patch mapping unknown hashes to llama-bpe (the regex family is identical, the tokenizer isn't used at codec_lm-driven inference anyway).

Sources

Original model: sesame/csm-1b
Conversion tooling: mybigday/codec.cpp (scripts: convert-backbone-to-gguf.py prep_csm + convert-to-gguf.py with the auto-dispatched CsmConverter)
Inference runtime: mybigday/llama.rn (codec_lm AR path lands in cpp/rn-tts.cpp::generateAudioCodes)

Downloads last month: 272

GGUF

Model size

0.5B params

Architecture

mimi

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Model tree for hans00/CSM-1B-GGUF

Base model

sesame/csm-1b

Quantized

(3)

this model