Dia-1.6B β GGUF (ggml)
GGUF / ggml conversion of nari-labs/Dia-1.6B for use with CrispStrobe/CrispASR.
Dia is a dialogue text-to-speech model that generates expressive 44.1 kHz speech from text, with [S1] / [S2] speaker tags:
- Text encoder (12-layer, 1024-d, byte-level vocab 256): encodes the prompt bytes.
- Audio decoder (18-layer, 2048-d, GQA 16 query / 4 KV heads, classifier-free guidance): autoregressively emits 9 interleaved DAC codebooks under a delay pattern
[0,8,9,10,11,12,13,14,15]. - DAC codec (44.1 kHz): decodes the 9 codebooks to PCM. Shipped as a separate required companion file.
Released under Apache 2.0.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
dia-1.6b-f16.gguf |
F16 | 3.0 GB | Main model β reference quality |
dac-44khz.gguf |
β | 104 MB | DAC codec β required companion (download both) |
Lower-bit quants (Q8_0 / Q4_K) are not published yet: Dia uses
scale=1.0attention (no1/βd), which is precision-sensitive, so quants need an ASR-roundtrip check before release.
Quick start
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr-cli
# 2. Download model + DAC codec
hf download cstr/dia-1.6b-GGUF dia-1.6b-f16.gguf dac-44khz.gguf --local-dir .
# 3. Synthesize (keep the codec beside the model, or pass --codec-model)
./build/bin/crispasr --backend dia -m dia-1.6b-f16.gguf \
--codec-model dac-44khz.gguf \
--tts "[S1] Hello there, how are you doing today? I really hope you are having a wonderful and pleasant time." \
--tts-output hello.wav --seed 42
Or with auto-download (pulls the model + DAC companion):
./build/bin/crispasr -m dia --auto-download \
--tts "[S1] The quick brown fox jumps over the lazy dog, and then it runs back again." \
--tts-output fox.wav
Prompt length matters. Dia is inconsistent on very short inputs (it may emit non-speech) β use prompts of at least ~100 characters. Start the text with a
[S1](or[S2]) speaker tag.
Parameters
| Parameter | Default | Description |
|---|---|---|
--seed N |
0 | RNG seed (0 = non-deterministic; output varies per seed) |
-tp N |
1.2 | Sampling temperature |
--codec-model PATH |
auto | DAC codec GGUF (auto-discovered beside the model) |
--tts-output PATH |
β | Output WAV path (44.1 kHz mono) |
Architecture details
- Text tokenizer: byte-level (vocab 256);
[S1]/[S2]map to bytes0x01/0x02. - Encoder: 12 layers, 1024-d, 16 heads, head_dim 128, RoPE (NeoX half-split),
scale=1.0. - Decoder: 18 layers, 2048-d; self-attn GQA 16q/4kv; cross-attn MHA (16/16) over the encoder;
scale=1.0; CFGcond + cfg_scaleΒ·(cond β uncond). - Codebooks: 9 DAC channels, delay pattern
[0,8,9,β¦,15], audio vocab 1024. - Codec: Descript Audio Codec (DAC) at 44.1 kHz.
Conversion
python models/convert-dia-to-gguf.py \
--input nari-labs/Dia-1.6B \
--output dia-1.6b-f16.gguf
Acknowledgements
- nari-labs/dia β original model and inference code
- descript/descript-audio-codec β DAC codec
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
Model tree for cstr/dia-1.6b-GGUF
Base model
nari-labs/Dia-1.6B