Continuous Audio Language Models
Paper β’ 2509.06926 β’ Published β’ 6
How to use cstr/pocket-tts-GGUF with Pocket-TTS:
from pocket_tts import TTSModel
import scipy.io.wavfile
tts_model = TTSModel.load_model("cstr/pocket-tts-GGUF")
voice_state = tts_model.get_state_for_audio_prompt(
"hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)
audio = tts_model.generate_audio(voice_state, "Hello world, this is a test.")
# Audio is a 1D torch tensor containing PCM data.
scipy.io.wavfile.write("output.wav", tts_model.sample_rate, audio.numpy())GGUF / ggml conversion of kyutai/pocket-tts for use with CrispStrobe/CrispASR.
Pocket TTS is a lightweight (~100M param) continuous-latent autoregressive TTS model from Kyutai, based on the CALM paper (arXiv:2509.06926). Unlike codebook-based TTS models, Pocket TTS emits continuous float vectors β no discrete tokens, no softmax sampling:
Single GGUF file β no separate codec companion needed (Mimi decoder weights are embedded).
Released under CC-BY-4.0 license.
| File | Quant | Size | Notes |
|---|---|---|---|
pocket-tts-english-novc-f16.gguf |
F16 | 381 MB | Full precision β reference quality |
pocket-tts-english-novc-q4_k.gguf |
Q4_K | 60 MB | Smallest β fast inference |
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli
# 2. Download model
huggingface-cli download cstr/pocket-tts-GGUF pocket-tts-english-novc-f16.gguf --local-dir .
# 3. Synthesize
./build/bin/crispasr --backend pocket-tts -m pocket-tts-english-novc-f16.gguf \
--tts "Hello, how are you today?" \
--tts-output hello.wav --seed 42
Or with auto-download:
./build/bin/crispasr -m pocket-tts --auto-download \
--tts "The quick brown fox jumps over the lazy dog." \
--tts-output fox.wav
from crispasr import Session
sess = Session("pocket-tts-english-novc-f16.gguf")
sess.set_tts_seed(42)
pcm = sess.synthesize("Hello world.")
sess.write_wav("hello.wav", pcm)
Converted with models/convert-pocket-tts-to-gguf.py from the CrispASR repo. The Mimi decoder and SentencePiece tokenizer are embedded in the single GGUF.
8-bit
16-bit
Base model
kyutai/pocket-tts