MOSS-TTSD v0.7 GGUF
GGUF conversion of fnlp/MOSS-TTSD-v0.7 โ a two-speaker dialogue voice-clone TTS model. Qwen3-1.7B language backbone + XY-Tokenizer 8-codebook audio codec; uses the parallel_heads_delay codec_lm shape (cb-0 is a text+speech merged vocab, cb-1..7 are RVQ audio codes with a per-cb delay shift register).
The model splits into two GGUFs:
- Backbone (
moss-ttsd-v0.7-<quant>.gguf) โ Qwen3-1.7B unwrapped fromMossTTSDForCausalLM's wrapper. Stockqwen3arch, runs in stock llama.cpp withembeddings=true. - Codec + codec_lm (
codec-<quant>.gguf) โ XY-Tokenizer (24 kHz, 8 RVQ codebooks of 1025 each) bundled with theparallel_heads_delaycodec_lm adaptor (per-channelLinear(hidden, vocab_i)heads + audio embed tables + delay shift register).
The parallel_heads_delay kind emits all 8 codebooks per AR step in parallel (no depth decoder); the optional per-cb delay_pattern skews them in time so cb-0 is read first and the others trail by their delay offsets.
Files
Backbone (moss-ttsd-v0.7-<quant>.gguf, Qwen3 arch โ 28 layers, hidden 2048)
| File | Size |
|---|---|
moss-ttsd-v0.7-f32.gguf |
4.5 GB |
moss-ttsd-v0.7-f16.gguf |
3.3 GB |
moss-ttsd-v0.7-bf16.gguf |
3.3 GB |
moss-ttsd-v0.7-q8_0.gguf |
1.8 GB |
moss-ttsd-v0.7-q6_k.gguf |
1.4 GB |
moss-ttsd-v0.7-q5_1.gguf |
1.3 GB |
moss-ttsd-v0.7-q5_k_m.gguf |
1.2 GB |
moss-ttsd-v0.7-q5_k_s.gguf |
1.2 GB |
moss-ttsd-v0.7-q5_0.gguf |
1.2 GB |
moss-ttsd-v0.7-q4_1.gguf |
1.1 GB |
moss-ttsd-v0.7-q4_k_m.gguf |
1.1 GB |
moss-ttsd-v0.7-q4_k_s.gguf |
1010 MB |
moss-ttsd-v0.7-q4_0.gguf |
1005 MB |
moss-ttsd-v0.7-q3_k_l.gguf |
956 MB |
moss-ttsd-v0.7-q3_k_m.gguf |
895 MB |
moss-ttsd-v0.7-q3_k_s.gguf |
826 MB |
moss-ttsd-v0.7-q2_k.gguf |
741 MB |
Codec + codec_lm (codec-<quant>.gguf, XY-Tokenizer + parallel_heads_delay adaptor)
| File | Size |
|---|---|
codec-f32.gguf |
2.6 GB |
codec-f16.gguf |
1.7 GB |
codec-q8_0.gguf |
1.2 GB |
codec-q5_k_m.gguf |
1.1 GB |
codec-q4_k_m.gguf |
994 MB |
Inference shape (text-modality codec_lm parallel-heads)
backbone (Qwen3, embeddings=true) hidden h
โ caller samples text token via llama_get_logits_ith โ text_tok
โ codec_lm_state_set_text_context(state, text_tok)
โ codec_lm_step_begin(state, h) // emits all 8 cb logits in parallel
โ for cb in 0..7: codec_lm_step_logits โ sample (with allowed_token_range) โ push
โ codec_lm_step_finish โ codes[8]
โ codec_lm_compose_audio_embd(codes) โ next-step embedding
โ feed via b.embd; loop until EOS
cb-0 is a merged text+speech vocabulary; the caller is expected to mask
its logits to [speech_start, speech_end) via the allowed_token_range
hook before sampling, otherwise free-running F16 backbones can flip into
text-space and hang.
Output: slice cb-0 out of the (T ร 8) matrix and feed cb-1..7 into
codec_decode for 24 kHz PCM.
Voice cloning
MOSS-TTSD is fundamentally a two-speaker dialogue voice-clone model.
Both speakers are conditioned via a [S1]โฆ[S2]โฆ prefix in the prompt
plus precomputed reference codes. The processor / speaker encoder
that produces these references is not bundled here; consume the
upstream processor or a precompiled speaker artefact.
Sources
- Upstream model:
fnlp/MOSS-TTSD-v0.7 - Audio codec:
fnlp/XY_Tokenizer_TTSD_V0_hf - Conversion tooling:
mybigday/codec.cpp(prep_moss_ttsd+lm_adaptor/moss_ttsd.py) - Inference runtime:
mybigday/llama.rn
- Downloads last month
- 554
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
32-bit