MOSS-TTSD v0.7 GGUF

GGUF conversion of fnlp/MOSS-TTSD-v0.7 — a two-speaker dialogue voice-clone TTS model. Qwen3-1.7B language backbone + XY-Tokenizer 8-codebook audio codec; uses the parallel_heads_delay codec_lm shape (cb-0 is a text+speech merged vocab, cb-1..7 are RVQ audio codes with a per-cb delay shift register).

The model splits into two GGUFs:

Backbone (moss-ttsd-v0.7-<quant>.gguf) — Qwen3-1.7B unwrapped from MossTTSDForCausalLM's wrapper. Stock qwen3 arch, runs in stock llama.cpp with embeddings=true.
Codec + codec_lm (codec-<quant>.gguf) — XY-Tokenizer (24 kHz, 8 RVQ codebooks of 1025 each) bundled with the parallel_heads_delay codec_lm adaptor (per-channel Linear(hidden, vocab_i) heads + audio embed tables + delay shift register).

The parallel_heads_delay kind emits all 8 codebooks per AR step in parallel (no depth decoder); the optional per-cb delay_pattern skews them in time so cb-0 is read first and the others trail by their delay offsets.

Files

Backbone (`moss-ttsd-v0.7-<quant>.gguf`, Qwen3 arch — 28 layers, hidden 2048)

File	Size
`moss-ttsd-v0.7-f32.gguf`	4.5 GB
`moss-ttsd-v0.7-f16.gguf`	3.3 GB
`moss-ttsd-v0.7-bf16.gguf`	3.3 GB
`moss-ttsd-v0.7-q8_0.gguf`	1.8 GB
`moss-ttsd-v0.7-q6_k.gguf`	1.4 GB
`moss-ttsd-v0.7-q5_1.gguf`	1.3 GB
`moss-ttsd-v0.7-q5_k_m.gguf`	1.2 GB
`moss-ttsd-v0.7-q5_k_s.gguf`	1.2 GB
`moss-ttsd-v0.7-q5_0.gguf`	1.2 GB
`moss-ttsd-v0.7-q4_1.gguf`	1.1 GB
`moss-ttsd-v0.7-q4_k_m.gguf`	1.1 GB
`moss-ttsd-v0.7-q4_k_s.gguf`	1010 MB
`moss-ttsd-v0.7-q4_0.gguf`	1005 MB
`moss-ttsd-v0.7-q3_k_l.gguf`	956 MB
`moss-ttsd-v0.7-q3_k_m.gguf`	895 MB
`moss-ttsd-v0.7-q3_k_s.gguf`	826 MB
`moss-ttsd-v0.7-q2_k.gguf`	741 MB

Codec + codec_lm (`codec-<quant>.gguf`, XY-Tokenizer + `parallel_heads_delay` adaptor)

File	Size
`codec-f32.gguf`	2.6 GB
`codec-f16.gguf`	1.7 GB
`codec-q8_0.gguf`	1.2 GB
`codec-q5_k_m.gguf`	1.1 GB
`codec-q4_k_m.gguf`	994 MB

Inference shape (text-modality codec_lm parallel-heads)

backbone (Qwen3, embeddings=true) hidden h
    → caller samples text token via llama_get_logits_ith → text_tok
    → codec_lm_state_set_text_context(state, text_tok)
    → codec_lm_step_begin(state, h)          // emits all 8 cb logits in parallel
    → for cb in 0..7: codec_lm_step_logits → sample (with allowed_token_range) → push
    → codec_lm_step_finish → codes[8]
    → codec_lm_compose_audio_embd(codes) → next-step embedding
    → feed via b.embd; loop until EOS

cb-0 is a merged text+speech vocabulary; the caller is expected to mask its logits to [speech_start, speech_end) via the allowed_token_range hook before sampling, otherwise free-running F16 backbones can flip into text-space and hang.

Output: slice cb-0 out of the (T × 8) matrix and feed cb-1..7 into codec_decode for 24 kHz PCM.

Voice cloning

MOSS-TTSD is fundamentally a two-speaker dialogue voice-clone model. Both speakers are conditioned via a [S1]…[S2]… prefix in the prompt plus precomputed reference codes. The processor / speaker encoder that produces these references is not bundled here; consume the upstream processor or a precompiled speaker artefact.

Sources

Upstream model: fnlp/MOSS-TTSD-v0.7
Audio codec: fnlp/XY_Tokenizer_TTSD_V0_hf
Conversion tooling: mybigday/codec.cpp (prep_moss_ttsd + lm_adaptor/moss_ttsd.py)
Inference runtime: mybigday/llama.rn

Downloads last month: 554

GGUF

Model size

0.9B params

Architecture

xy_tokenizer

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Model tree for hans00/MOSS-TTSD-v0.7-GGUF

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

OpenMOSS-Team/MOSS-TTSD-v0.7

Quantized

(1)

this model