⚑ QwenPaw-Flash-9B-MTP

MTP Speculative Decoding Β· Agent-Optimized Β· 1.5–4.1Γ— Speedup

πŸ“– δΈ­ζ–‡ζ–‡ζ‘£

MTP Dense 9B Q6_K / Q8_0 / Q4_K_M 262K Context Vision Encoder

The original QwenPaw-Flash-9B (no abliteration) with MTP (Multi-Token Prediction) head weights injected from the official Qwen3.5-9B base model. By reconstructing the MTP speculative decoding head β€” stripped during QwenPaw fine-tuning β€” this model achieves up to 4.1Γ— inference speedup while maintaining accuracy. Heretic (uncensored) version: QwenPaw-Flash-9B-heretic-MTP-GGUF

⚑ MTP Speculative Decoding
What is MTP?
Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2–4Γ— speedup in practice.
Injection Method
The official Qwen3.5-9B ships with a 4-layer MTP head (~243M params). During QwenPaw fine-tuning, the MTP head weights were stripped β€” only the config placeholder mtp_num_hidden_layers: 1 remained.
Recovery Process
1. Download official Qwen3.5-9B base model weights
2. Extract MTP layer weights (15 tensors starting with mtp.)
3. Merge into QwenPaw-Flash-9B safetensors
4. Convert to GGUF with convert_hf_to_gguf.py (with MTP support)
5. Quantize with llama-quantize
Why This Works
The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space β€” the main model's weights are unchanged, so no fine-tuning is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.
πŸ—οΈ Architecture
TypeQwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
Main Model~9B parameters
MTP Head~243M parameters (2.7% overhead)
Layers32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
Context Length262,144 tokens
Speculative Decoding--spec-type draft-mtp with --spec-draft-n-max 2
MTP Acceptance Rate~50% (measured on heretic version)
πŸ“¦ GGUF Files
File Size Type Notes
QwenPaw-Flash-9B-MTP-BF16.gguf17.14 GBBF16Full precision, reference quality
QwenPaw-Flash-9B-MTP-Q8_0.gguf9.11 GBQ8_0~8.5 bpw, near-lossless
QwenPaw-Flash-9B-MTP-Q6_K.gguf7.04 GBQ6_Kβœ… Recommended, best value
QwenPaw-Flash-9B-MTP-Q4_K_M.gguf5.38 GBQ4_K_MCompact, best size/quality tradeoff
mmproj-QwenPaw-Flash-9B-heretic-BF16.gguf0.86 GBBF16Vision encoder (multimodal)
πŸš€ Usage
With MTP Speculative Decoding
llama-server -m QwenPaw-Flash-9B-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8088
Without MTP (fallback)
# Just omit spec args β€” works as a normal GGUF
llama-server -m QwenPaw-Flash-9B-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --host 0.0.0.0 --port 8088

Compatible with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF runtimes. For MTP, use --spec-type draft-mtp. The MTP head is a lossless copy from Qwen3.5-9B β€” no training involved.

πŸ”— Credits

Base Model: agentscope-ai/QwenPaw-Flash-9B
MTP Head Source: Qwen/Qwen3.5-9B
Quantization Tool: llama.cpp Β· GitHub

Downloads last month
3,599
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for SC117/QwenPaw-Flash-9B-MTP-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(338)
this model

Collection including SC117/QwenPaw-Flash-9B-MTP-GGUF