Dia-1.6B — GGUF (ggml)

GGUF / ggml conversion of nari-labs/Dia-1.6B for use with CrispStrobe/CrispASR.

Dia is a dialogue text-to-speech model that generates expressive 44.1 kHz speech from text, with [S1] / [S2] speaker tags:

Text encoder (12-layer, 1024-d, byte-level vocab 256): encodes the prompt bytes.
Audio decoder (18-layer, 2048-d, GQA 16 query / 4 KV heads, classifier-free guidance): autoregressively emits 9 interleaved DAC codebooks under a delay pattern [0,8,9,10,11,12,13,14,15].
DAC codec (44.1 kHz): decodes the 9 codebooks to PCM. Shipped as a separate required companion file.

Released under Apache 2.0.

Files

File	Quant	Size	Notes
`dia-1.6b-f16.gguf`	F16	3.0 GB	Main model — reference quality
`dac-44khz.gguf`	—	104 MB	DAC codec — required companion (download both)

Lower-bit quants (Q8_0 / Q4_K) are not published yet: Dia uses scale=1.0 attention (no 1/√d), which is precision-sensitive, so quants need an ASR-roundtrip check before release.

Quick start

# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli

# 2. Download model + DAC codec
hf download cstr/dia-1.6b-GGUF dia-1.6b-f16.gguf dac-44khz.gguf --local-dir .

# 3. Synthesize (keep the codec beside the model, or pass --codec-model)
./build/bin/crispasr --backend dia -m dia-1.6b-f16.gguf \
    --codec-model dac-44khz.gguf \
    --tts "[S1] Hello there, how are you doing today? I really hope you are having a wonderful and pleasant time." \
    --tts-output hello.wav --seed 42

Or with auto-download (pulls the model + DAC companion):

./build/bin/crispasr -m dia --auto-download \
    --tts "[S1] The quick brown fox jumps over the lazy dog, and then it runs back again." \
    --tts-output fox.wav

Prompt length matters. Dia is inconsistent on very short inputs (it may emit non-speech) — use prompts of at least ~100 characters. Start the text with a [S1] (or [S2]) speaker tag.

Parameters

Parameter	Default	Description
`--seed N`	0	RNG seed (0 = non-deterministic; output varies per seed)
`-tp N`	1.2	Sampling temperature
`--codec-model PATH`	auto	DAC codec GGUF (auto-discovered beside the model)
`--tts-output PATH`	—	Output WAV path (44.1 kHz mono)

Architecture details

Text tokenizer: byte-level (vocab 256); [S1]/[S2] map to bytes 0x01/0x02.
Encoder: 12 layers, 1024-d, 16 heads, head_dim 128, RoPE (NeoX half-split), scale=1.0.
Decoder: 18 layers, 2048-d; self-attn GQA 16q/4kv; cross-attn MHA (16/16) over the encoder; scale=1.0; CFG cond + cfg_scale·(cond − uncond).
Codebooks: 9 DAC channels, delay pattern [0,8,9,…,15], audio vocab 1024.
Codec: Descript Audio Codec (DAC) at 44.1 kHz.

Conversion

python models/convert-dia-to-gguf.py \
    --input nari-labs/Dia-1.6B \
    --output dia-1.6b-f16.gguf

Acknowledgements

nari-labs/dia — original model and inference code
descript/descript-audio-codec — DAC codec

Downloads last month: -

GGUF

Model size

54.3M params

Architecture

dac-44khz

Hardware compatibility

8-bit

16-bit

View +1 variant

Model tree for cstr/dia-1.6b-GGUF

Base model

nari-labs/Dia-1.6B

Quantized

(6)

this model