MOSS-TTSD v0.7 GGUF

GGUF conversion of fnlp/MOSS-TTSD-v0.7 โ€” a two-speaker dialogue voice-clone TTS model. Qwen3-1.7B language backbone + XY-Tokenizer 8-codebook audio codec; uses the parallel_heads_delay codec_lm shape (cb-0 is a text+speech merged vocab, cb-1..7 are RVQ audio codes with a per-cb delay shift register).

The model splits into two GGUFs:

  • Backbone (moss-ttsd-v0.7-<quant>.gguf) โ€” Qwen3-1.7B unwrapped from MossTTSDForCausalLM's wrapper. Stock qwen3 arch, runs in stock llama.cpp with embeddings=true.
  • Codec + codec_lm (codec-<quant>.gguf) โ€” XY-Tokenizer (24 kHz, 8 RVQ codebooks of 1025 each) bundled with the parallel_heads_delay codec_lm adaptor (per-channel Linear(hidden, vocab_i) heads + audio embed tables + delay shift register).

The parallel_heads_delay kind emits all 8 codebooks per AR step in parallel (no depth decoder); the optional per-cb delay_pattern skews them in time so cb-0 is read first and the others trail by their delay offsets.

Files

Backbone (moss-ttsd-v0.7-<quant>.gguf, Qwen3 arch โ€” 28 layers, hidden 2048)

File Size
moss-ttsd-v0.7-f32.gguf 4.5 GB
moss-ttsd-v0.7-f16.gguf 3.3 GB
moss-ttsd-v0.7-bf16.gguf 3.3 GB
moss-ttsd-v0.7-q8_0.gguf 1.8 GB
moss-ttsd-v0.7-q6_k.gguf 1.4 GB
moss-ttsd-v0.7-q5_1.gguf 1.3 GB
moss-ttsd-v0.7-q5_k_m.gguf 1.2 GB
moss-ttsd-v0.7-q5_k_s.gguf 1.2 GB
moss-ttsd-v0.7-q5_0.gguf 1.2 GB
moss-ttsd-v0.7-q4_1.gguf 1.1 GB
moss-ttsd-v0.7-q4_k_m.gguf 1.1 GB
moss-ttsd-v0.7-q4_k_s.gguf 1010 MB
moss-ttsd-v0.7-q4_0.gguf 1005 MB
moss-ttsd-v0.7-q3_k_l.gguf 956 MB
moss-ttsd-v0.7-q3_k_m.gguf 895 MB
moss-ttsd-v0.7-q3_k_s.gguf 826 MB
moss-ttsd-v0.7-q2_k.gguf 741 MB

Codec + codec_lm (codec-<quant>.gguf, XY-Tokenizer + parallel_heads_delay adaptor)

File Size
codec-f32.gguf 2.6 GB
codec-f16.gguf 1.7 GB
codec-q8_0.gguf 1.2 GB
codec-q5_k_m.gguf 1.1 GB
codec-q4_k_m.gguf 994 MB

Inference shape (text-modality codec_lm parallel-heads)

backbone (Qwen3, embeddings=true) hidden h
    โ†’ caller samples text token via llama_get_logits_ith โ†’ text_tok
    โ†’ codec_lm_state_set_text_context(state, text_tok)
    โ†’ codec_lm_step_begin(state, h)          // emits all 8 cb logits in parallel
    โ†’ for cb in 0..7: codec_lm_step_logits โ†’ sample (with allowed_token_range) โ†’ push
    โ†’ codec_lm_step_finish โ†’ codes[8]
    โ†’ codec_lm_compose_audio_embd(codes) โ†’ next-step embedding
    โ†’ feed via b.embd; loop until EOS

cb-0 is a merged text+speech vocabulary; the caller is expected to mask its logits to [speech_start, speech_end) via the allowed_token_range hook before sampling, otherwise free-running F16 backbones can flip into text-space and hang.

Output: slice cb-0 out of the (T ร— 8) matrix and feed cb-1..7 into codec_decode for 24 kHz PCM.

Voice cloning

MOSS-TTSD is fundamentally a two-speaker dialogue voice-clone model. Both speakers are conditioned via a [S1]โ€ฆ[S2]โ€ฆ prefix in the prompt plus precomputed reference codes. The processor / speaker encoder that produces these references is not bundled here; consume the upstream processor or a precompiled speaker artefact.

Sources

Downloads last month
554
GGUF
Model size
0.9B params
Architecture
xy_tokenizer
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hans00/MOSS-TTSD-v0.7-GGUF

Quantized
(1)
this model