Pocket TTS — GGUF (ggml-quantised)

GGUF / ggml conversion of kyutai/pocket-tts for use with CrispStrobe/CrispASR.

Pocket TTS is a lightweight (~100M param) continuous-latent autoregressive TTS model from Kyutai, based on the CALM paper (arXiv:2509.06926). Unlike codebook-based TTS models, Pocket TTS emits continuous float vectors — no discrete tokens, no softmax sampling:

FlowLM backbone — causal transformer (1024D, 16 heads, 6 layers, RoPE, GELU) operating at 12.5 Hz
Consistency head — SimpleMLPAdaLN (512D, 6 ResBlocks) with timestep embedding → one-step LSD decode → 32-dim continuous latent vectors
Mimi VAE decoder — SEANet upsample convolutions + 2-layer transformer → 24 kHz PCM
Text tokenizer — SentencePiece BPE (4000 vocab, embedded in GGUF)

Single GGUF file — no separate codec companion needed (Mimi decoder weights are embedded).

Released under CC-BY-4.0 license.

Files

File	Quant	Size	Notes
`pocket-tts-english-novc-f16.gguf`	F16	381 MB	Full precision — reference quality
`pocket-tts-english-novc-q4_k.gguf`	Q4_K	60 MB	Smallest — fast inference

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli

# 2. Download model
huggingface-cli download cstr/pocket-tts-GGUF pocket-tts-english-novc-f16.gguf --local-dir .

# 3. Synthesize
./build/bin/crispasr --backend pocket-tts -m pocket-tts-english-novc-f16.gguf \
    --tts "Hello, how are you today?" \
    --tts-output hello.wav --seed 42

Or with auto-download:

./build/bin/crispasr -m pocket-tts --auto-download \
    --tts "The quick brown fox jumps over the lazy dog." \
    --tts-output fox.wav

Python binding

from crispasr import Session

sess = Session("pocket-tts-english-novc-f16.gguf")
sess.set_tts_seed(42)
pcm = sess.synthesize("Hello world.")
sess.write_wav("hello.wav", pcm)

Conversion

Converted with models/convert-pocket-tts-to-gguf.py from the CrispASR repo. The Mimi decoder and SentencePiece tokenizer are embedded in the single GGUF.

Downloads last month: 311

GGUF

Model size

0.1B params

Architecture

pocket-tts

Hardware compatibility

8-bit

16-bit

Model tree for cstr/pocket-tts-GGUF

Base model

kyutai/pocket-tts

Quantized

(15)

this model

Paper for cstr/pocket-tts-GGUF

Continuous Audio Language Models

Paper • 2509.06926 • Published Sep 8, 2025 • 6