Pocket TTS β€” GGUF (ggml-quantised)

GGUF / ggml conversion of kyutai/pocket-tts for use with CrispStrobe/CrispASR.

Pocket TTS is a lightweight (~100M param) continuous-latent autoregressive TTS model from Kyutai, based on the CALM paper (arXiv:2509.06926). Unlike codebook-based TTS models, Pocket TTS emits continuous float vectors β€” no discrete tokens, no softmax sampling:

  • FlowLM backbone β€” causal transformer (1024D, 16 heads, 6 layers, RoPE, GELU) operating at 12.5 Hz
  • Consistency head β€” SimpleMLPAdaLN (512D, 6 ResBlocks) with timestep embedding β†’ one-step LSD decode β†’ 32-dim continuous latent vectors
  • Mimi VAE decoder β€” SEANet upsample convolutions + 2-layer transformer β†’ 24 kHz PCM
  • Text tokenizer β€” SentencePiece BPE (4000 vocab, embedded in GGUF)

Single GGUF file β€” no separate codec companion needed (Mimi decoder weights are embedded).

Released under CC-BY-4.0 license.

Files

File Quant Size Notes
pocket-tts-english-novc-f16.gguf F16 381 MB Full precision β€” reference quality
pocket-tts-english-novc-q4_k.gguf Q4_K 60 MB Smallest β€” fast inference

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli

# 2. Download model
huggingface-cli download cstr/pocket-tts-GGUF pocket-tts-english-novc-f16.gguf --local-dir .

# 3. Synthesize
./build/bin/crispasr --backend pocket-tts -m pocket-tts-english-novc-f16.gguf \
    --tts "Hello, how are you today?" \
    --tts-output hello.wav --seed 42

Or with auto-download:

./build/bin/crispasr -m pocket-tts --auto-download \
    --tts "The quick brown fox jumps over the lazy dog." \
    --tts-output fox.wav

Python binding

from crispasr import Session

sess = Session("pocket-tts-english-novc-f16.gguf")
sess.set_tts_seed(42)
pcm = sess.synthesize("Hello world.")
sess.write_wav("hello.wav", pcm)

Conversion

Converted with models/convert-pocket-tts-to-gguf.py from the CrispASR repo. The Mimi decoder and SentencePiece tokenizer are embedded in the single GGUF.

Downloads last month
311
GGUF
Model size
0.1B params
Architecture
pocket-tts
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/pocket-tts-GGUF

Quantized
(15)
this model

Paper for cstr/pocket-tts-GGUF