⚡ QwenPaw-Flash-9B-MTP

MTP Speculative Decoding · Agent-Optimized · 1.5–4.1× Speedup

MTP Dense 9B Q6_K / Q8_0 / Q4_K_M 262K Context Vision Encoder

The original QwenPaw-Flash-9B (no abliteration) with MTP (Multi-Token Prediction) head weights injected from the official Qwen3.5-9B base model. By reconstructing the MTP speculative decoding head — stripped during QwenPaw fine-tuning — this model achieves up to 4.1× inference speedup while maintaining accuracy. Heretic (uncensored) version: QwenPaw-Flash-9B-heretic-MTP-GGUF

⚡ MTP Speculative Decoding

What is MTP? Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2–4× speedup in practice.
Injection Method The official Qwen3.5-9B ships with a 4-layer MTP head (~243M params). During QwenPaw fine-tuning, the MTP head weights were stripped — only the config placeholder `mtp_num_hidden_layers: 1` remained.
Recovery Process 1. Download official Qwen3.5-9B base model weights 2. Extract MTP layer weights (15 tensors starting with `mtp.`) 3. Merge into QwenPaw-Flash-9B safetensors 4. Convert to GGUF with `convert_hf_to_gguf.py` (with MTP support) 5. Quantize with `llama-quantize`
Why This Works The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space — the main model's weights are unchanged, so no fine-tuning is needed. The head simply needs to exist with compatible dimensions for llama.cpp's `--spec-type draft-mtp` to activate.

🏗️ Architecture

Type	Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
Main Model	~9B parameters
MTP Head	~243M parameters (2.7% overhead)
Layers	32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
Context Length	262,144 tokens
Speculative Decoding	`--spec-type draft-mtp` with `--spec-draft-n-max 2`
MTP Acceptance Rate	~50% (measured on heretic version)

📦 GGUF Files

File	Size	Type	Notes
`QwenPaw-Flash-9B-MTP-BF16.gguf`	17.14 GB	BF16	Full precision, reference quality
`QwenPaw-Flash-9B-MTP-Q8_0.gguf`	9.11 GB	Q8_0	~8.5 bpw, near-lossless
`QwenPaw-Flash-9B-MTP-Q6_K.gguf`	7.04 GB	Q6_K	✅ Recommended, best value
`QwenPaw-Flash-9B-MTP-Q4_K_M.gguf`	5.38 GB	Q4_K_M	Compact, best size/quality tradeoff
`mmproj-QwenPaw-Flash-9B-heretic-BF16.gguf`	0.86 GB	BF16	Vision encoder (multimodal)

🚀 Usage

With MTP Speculative Decoding
llama-server -m QwenPaw-Flash-9B-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --host 0.0.0.0 --port 8088
Without MTP (fallback)
# Just omit spec args — works as a normal GGUF llama-server -m QwenPaw-Flash-9B-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --host 0.0.0.0 --port 8088

Compatible with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF runtimes. For MTP, use --spec-type draft-mtp. The MTP head is a lossless copy from Qwen3.5-9B — no training involved.

🔗 Credits

Base Model: agentscope-ai/QwenPaw-Flash-9B
MTP Head Source: Qwen/Qwen3.5-9B
Quantization Tool: llama.cpp · GitHub

Downloads last month: 3,599

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/QwenPaw-Flash-9B-MTP-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(338)

this model

Collection including SC117/QwenPaw-Flash-9B-MTP-GGUF

Qwenpaw-Flash

Collection

QwenPaw-Flash is a lightweight model deeply optimized for the QwenPaw autonomous agent scenario. Since its training phase, the model has been specific • 7 items • Updated 12 days ago • 1