Qwen3.6-27B-DSV4Pro-Thinking-Distill

🇬🇧 English · 🇨🇳 中文 ⬇️

🇬🇧 English

On Qwen3.6-27B (Dense, 64 layers, Gated DeltaNet linear/full-attention hybrid), we use LoRA to distill the way DeepSeek-V4-Pro reasons (with thinking-on) plus its agentic behavior.

This is the Dense counterpart of the 35B-A3B (MoE) sister model: same R6000 GPU, same teacher, same recipe, swapped onto a Dense architecture — proving the gains come from the distilled thinking style, not an MoE architectural bonus. A native MTP head is welded on for single-stream acceleration.

⚠️ Distilling a thinking style ≠ distilling knowledge/capability: the goal is "learn how to reason and how to converge", not to inject knowledge or raise the capability ceiling.

Training details

Base: Qwen3.6-27B (Dense, BF16 base)
Method: LoRA, r = 64, α = 128, dropout = 0.05, targets = all attention + MLP projections
Optim: paged_adamw_8bit, cosine LR, warmup 0.03, ~1 epoch
Teacher: DeepSeek-V4-Pro (thinking-on + agentic)
Data: ~1842 distillation samples (lynn_prod spec). Trajectories = DS-V4-Pro multi-step reasoning under thinking-on (<think>) + ReAct-style tool calls (think one step → call one tool → observe → loop).
- The tool "execution results" are SIMULATED, not actually run: in the multi-turn tool calls, each "execution result" line is improvised by a small, fast model (DeepSeek-V4-Flash) role-playing the "runtime" — not obtained by actually running code in a sandbox. So it differs from real execution.
- Training masks those fabricated results — the model learns only "how to think / how to call tools", not the made-up outputs: because the results are fake, training on them would teach the model the bad habit of fabricating tool return values; so we optimize only the model's own "reasoning + tool-call" tokens.
Artifacts: merged → BF16 safetensors → gguf/ Q4_K_M-imatrix (with native MTP)

Attribution (the method is not original — it is a combination of published techniques)

ReAct (interleaved reasoning + acting): Yao et al., 2022, arXiv:2210.03629 (ICLR 2023)
STaR (bootstrapping reasoning traces): Zelikman et al., 2022, arXiv:2203.14465
Self-Instruct / Baize self-chat: Wang et al., 2022; Xu et al., 2023, arXiv:2304.01196
AgentTuning: Zeng et al., 2023, arXiv:2310.12823
ToolBench / ToolLLM (tool use): Qin et al., 2023, arXiv:2307.16789
DeepSeek-R1 reasoning distillation: DeepSeek-AI, 2025, arXiv:2501.12948

Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B

Quantization parity (important — prevents misreading): this model and the base are both Q4_K_M (imatrix-corrected) GGUF + native MTP, fully same-spec — same Q4_K_M, same imatrix, same MTP. The only variable is "distilled or not". This is not distilled-Q4_K_M vs base-BF16 (which would be unfair); the Δ below is cleanly attributable to distillation itself, with no quantization difference mixed in.

Dimension	This model (distill)	Original base	Δ
GPQA-Diamond-198	80.81% (160/198, 32K)	73.7% (146/198)	+7.1
MMLU-500 (5-shot)	91.8% (459/500)	91.6%	+0.2
GPQA unconverged empty answers (parse_fail)	0	14	−14
coding-100 (10 langs × 10)	86/100	83/100	+3
Agentic SOLO (20 complex tasks)	16/20	13/20	+3

Reading: hard reasoning improves markedly (GPQA +7.1pp, 160/198 = 80.81%, 0 error / 0 parse_fail), and knowledge does not drop — it even nudges up (MMLU +0.2pp), while "finish thinking, then converge" holds — GPQA unconverged empty answers fall from 14 to 0 (the base's 14 were all cases that thought to the 32K limit without ever giving an answer; after distillation, zero). Median generation length is compressed to ~3006 tokens. Note: the 35B-A3B distill lost 1.6pp MMLU, whereas the 27B Dense has more capacity — it fits the distillation without crowding out knowledge: GPQA up, MMLU not down, a cleaner "pure gain".

coding-100: same harness, a real sandbox runs the code and checks whether the tests actually pass (objective). distill 86 ≥ base 83 — coding ability did not drop, and is slightly higher.

Agentic SOLO: the model orchestrates + executes 20 complex tasks by itself; judge = the task/harness author (who knows best whether it was "actually done"). distill 16 > base 13. ⚠️ This metric is judge-subjective (a stricter judge ties the two), so treat it as a trend — the hard numbers are GPQA / coding.

Q5_K_M evaluation — distill vs base, streaming harness (re-run)

Protocol (annotated — DIFFERENT from the Q4_K_M section above; do not cross-compare tiers): both this distill and the base are Q5_K_M-imatrix GGUF + native MTP, same-spec, served base-mode (MTP off) for a concurrency eval. Distill harness = SSE streaming (stream=True), timeout 1800s · concurrency 4 · max_tokens 32000, thinking-on, temp 0.6 / top_p 0.95, finish_reason logged per question. Base GPQA result is from the original conc=4 run (non-streaming), but finish_reason data confirms 0 errors (zero false timeouts) — the 12 length hits each show completion_tokens=32768, i.e. genuinely unconverged, not harness artifacts. Base MMLU re-run uses the same SSE streaming harness as distill.

Dimension	Distill Q5_K_M	Base Q5_K_M	Δ
GPQA-Diamond-198	81.82% (162/198)	68.69% (136/198)	+13.13pp
MMLU-500 (5-shot)	90.0% (450/500)	89.6% (448/500)	+0.4pp
GPQA `finish=stop` (converged)	198 / 198	186 / 198
GPQA `finish=length` (hit 32K wall, never answered)	0	12
GPQA errors (timeout/etc.)	0	0

Reading: under the streaming harness (zero false-timeouts, confirmed by errors=0), the distill converges on every single question (198/198 stop, 0 length), while the base runs into the 32K wall on 12 questions (length, never produces an answer). This is the hardest, cleanest evidence of the distillation's "learn to converge / 收口" effect — now quantified by finish_reason, not just accuracy.

⚠️ Do NOT cross-compare quant tiers: the Q4_K_M table uses an older (non-streaming) harness; this Q5_K_M table uses the streaming harness. Comparing e.g. base-Q4 vs base-Q5 across tiers is meaningless (harness differs). Only the within-tier distill-vs-base Δ is valid.

MTP (multi-token prediction) single-stream acceleration — measured best config + lossless

两条 MTP 通路:GGUF 走 llama.cpp(--spec-type draft-mtp,见下);BF16 / FP8 safetensors 走 vLLM / SGLang(--speculative-config '{"method":"mtp","num_speculative_tokens":3}')—— BF16 与 FP8 仓现均已焊原生 nextn 头(SGLang 实测 accept 0.76–0.88;BF16 在 Ampere 等无原生 FP8 的卡上更合适)。 Two MTP paths: GGUF via llama.cpp (--spec-type draft-mtp, below); BF16 / FP8 safetensors via vLLM / SGLang (--speculative-config '{"method":"mtp","num_speculative_tokens":3}') — both repos now bundle the native nextn head (SGLang-measured accept 0.76–0.88).

This model's gguf contains a native MTP head (mainline llama.cpp --spec-type draft-mtp; no -md / external draft model needed).

Best config measured (single-stream, Q4_K_M-imatrix; tested on DGX Spark GB10, unified-memory bandwidth-bound — Mac / Blackwell RTX-50 (FP4) can be faster):

`--spec-draft-n-max` (p-min=0)	single-stream TPS	draft accept rate	mean accept len
bare no-MTP	10.4	—	—
n-max=2	24.1	0.82	2.64
n-max=3 ⭐ (recommended)	26.8	0.72	3.16
n-max=4	27.4	0.65	3.62

2.3–2.6× single-stream speedup (vs bare 10.4 TPS); n-max=3 is the throughput/accept-rate balance point.
Greedy speculative decoding is lossless by construction: it only accepts the target-argmax token. Batched-verify GEMM rounding produces character-level differences on near-tie tokens — this is FP non-determinism, not quality loss (any two independent runs show it, even with MTP off).
Speculation is a single-stream latency tool; concurrency degrades it (spec tokens take up KV/batch capacity) — for throughput scenarios use bare multi-concurrency mode.
Recommended launch: llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja

Note: single-stream TPS varies by content — coding prompts accept ~0.72 → 26.8 t/s, reasoning prompts ~0.58 → 24.6 t/s (n-max=3, all measured on DGX Spark GB10). The current MTP is the base's native nextn head grafted on (lossless); the base head predicts a bit weakly on the post-distillation reasoning distribution, so the reasoning accept rate is lower. A distill-specific retrained MTP head (to pull accept back to ~0.8) is on the roadmap.

Eval protocol

thinking-on; temp 0.6 / top_p 0.95 (required for thinking models — greedy loops to death); max_tokens 32768; read-timeout ≥ 2400s. The same spec is applied to every compared model.

Limitations

Distills thinking style, not capability: black-box SFT cannot raise the knowledge ceiling.
Tool execution results are "simulated", not actually run:
- This version (compromise): each "execution result" line in the multi-turn tool calls is improvised by a small model (DeepSeek-V4-Flash) role-playing the "runtime", not obtained by actually running code in a sandbox. Chosen purely for cost and speed — real execution needs a full "generate → run in a real sandbox → feed results back to the teacher → continue" agentic harness, which is slow and heavy; one simulated pass is enough. This is an engineering trade-off, not because it is better.
- Cost (sim-to-real gap): a simulated result can be wrong (Flash may optimistically fabricate "tests passed" when that code would actually crash) → the model can learn from "fake-success" trajectories, and may even acquire the tendency to fabricate tool return values itself.
- Optimal approach (coming in the next version) = real-sandbox execution + rejection sampling: every tool call runs in a real environment to get a real result, then a judge keeps only the trajectories that genuinely solved the task and discards the failed ones — eliminating "fake success" at the root. We have already implemented this pipeline (real sandbox + DS judge), but this version's data did not use it; the next distillation will be redone with it.
- Note: simulation ≠ rejection sampling — simulation is about "how the observation is obtained" (fabricate vs. really run); rejection sampling is about "filtering out the wrong ones by real outcome". Because simulation never really runs, it leaves no ground on which rejection sampling could even operate.

Files

*.safetensors — BF16 merged weights (SGLang / vLLM / transformers)
gguf/Qwen3.6-27B-DSV4Pro-Distill-MTP-Q4_K_M-imatrix.gguf — the only GGUF, native MTP version (Q4_K_M-imatrix). Add --spec-type draft-mtp for the fastest single-stream; without that flag it is just a normal Q4_K_M model (MTP head inactive) — so no separate "non-MTP plain version" is provided, to keep anyone from downloading the wrong file and thinking it lacks MTP.
NVFP4 (W4A16-style) — quality-first ModelOpt NVFP4. Language MLP gate/up/down_proj compressed to FP4; attention, Mamba, vision, embeddings, lm_head, norms kept high-precision. vLLM/SGLang high-concurrency. GPQA 82.83% / MMLU 87.80%. Single-stream MTP → use GGUF.

Inference

thinking-on, always use temp=0.6, top_p=0.95 (never greedy). llama.cpp: gguf + --jinja (MTP version add --spec-type draft-mtp --spec-draft-n-max 3); SGLang / vLLM: safetensors.

🇨🇳 中文版

在 Qwen3.6-27B（Dense，64 层，Gated DeltaNet 线性/全注意力混合）上，用 LoRA 蒸馏 DeepSeek-V4-Pro 在「思考开启(thinking-on)」时的思维方式 + agentic 行为。

这是 35B-A3B(MoE)姊妹版的 Dense 复现：同一台 R6000、同一 teacher、同一套配方，换到 Dense 架构——证明提升来自蒸进去的思维方式，不是 MoE 架构红利。并焊了原生 MTP 做单流加速。

⚠️ 蒸思维方式 ≠ 蒸知识/能力：目标是「学会怎么想、怎么收口」，不是蒸知识或扩能力上限。

训练配置(如实披露)

基座 Base：Qwen3.6-27B(Dense，BF16 基座)
方法 Method：LoRA，r = 64，α = 128，dropout = 0.05，target = 全部注意力 + MLP 投影
优化：paged_adamw_8bit，cosine LR，warmup 0.03，约 1 epoch
Teacher：DeepSeek-V4-Pro（thinking-on + agentic）
数据 Data：~1842 条蒸馏样本（lynn_prod 口径）。轨迹 = DS-V4-Pro 在 thinking-on 下的多步推理（<think>）+ ReAct 式工具调用（想一步 → 调一次工具 → 看结果，循环）。
- 工具的「执行结果」是模拟的,不是真跑的：多轮工具调用里那一行行「执行结果」，是用一个又小又快的模型（DeepSeek-V4-Flash）扮演"运行环境"现编出来的，并不是真的在沙箱里跑代码得到的——所以和真实运行有差距。
- **训练时只学"怎么想、怎么调工具",不学那些编出来的"执行结果"**：因为执行结果是假的，如果让模型去学它，模型就会养成"自己瞎编工具返回值"的坏习惯；所以我们只优化模型自己产出的「思考 + 工具调用」部分。
产物：合并 → BF16 safetensors → gguf/ Q4_K_M-imatrix（含原生 MTP 版）

方法非自创，是公开技术的组合(如实归因)

ReAct（推理+行动交替）：Yao et al., 2022, arXiv:2210.03629(ICLR 2023)
**STaR(reasoning trace 自举)**：Zelikman et al., 2022, arXiv:2203.14465
Self-Instruct / Baize 自对话：Wang et al., 2022;Xu et al., 2023, arXiv:2304.01196
AgentTuning：Zeng et al., 2023, arXiv:2310.12823
**ToolBench / ToolLLM(工具调用)**：Qin et al., 2023, arXiv:2307.16789
DeepSeek-R1 推理蒸馏：DeepSeek-AI, 2025, arXiv:2501.12948

评测(Q4_K_M,旧 harness):同一 harness,thinking-on,vs 原版 Qwen3.6-27B

量化口径(重要,防误读):本模型与原版 base 都是 Q4_K_M(imatrix 校正)GGUF + 原生 MTP,完全同口径 —— 同 Q4_K_M、同 imatrix、同 MTP,唯一变量是"是否蒸馏"。不是拿蒸馏-Q4_K_M 去比 base-BF16(那样不公平);下面的 Δ 干净地归因于蒸馏本身,不掺量化差异。

维度	本模型(蒸馏)	原版 base	Δ
GPQA-Diamond-198	80.81%(160/198,32K)	73.7%(146/198)	+7.1
MMLU-500 (5-shot)	91.8%(459/500)	91.6%	+0.2
GPQA 未收口空答 (parse_fail)	0	14	-14
coding-100 (10 语言×10)	86/100	83/100	+3
Agentic SOLO (20 复杂任务)	16/20	13/20	+3

解读：硬推理显著提升(GPQA +7.1pp,160/198=80.81%,0 error/0 parse_fail)、知识不降反微涨(MMLU +0.2pp),且「想完就收口」—— GPQA 未收口空答从 14 降到 0（base 那 14 个全是思考到 32K 上限还没给答案;蒸馏后彻底归零)。中位生成长度压到 ~3006 token。注:35B-A3B 蒸馏 MMLU 掉 1.6pp,27B Dense 容量更大、装得下蒸馏而不挤占知识 —— **GPQA 涨、MMLU 不降,更干净的"纯赚"**。

coding-100:同一 harness、真沙箱跑代码看测试是否真过(客观)。distill 86 ≥ base 83,coding 能力没掉、还略高。

Agentic SOLO:模型自己编排+自己执行 20 道复杂任务,判官 = 出题/harness 作者(对"做没做到"最清楚)。distill 16 > base 13。⚠️ 此项判官主观性强(换更严判官两者打平),作趋势参考,硬指标看 GPQA/coding。

Q5_K_M 评测 —— 蒸馏 vs 原版,流式 harness(重测)

口径(已标注 —— 与上方 Q4_K_M 段口径不同,禁止跨档比):本蒸馏与原版**均为 Q5_K_M-imatrix GGUF + 原生 MTP、同规格、base 模式(MTP 关)**跑并发评测。Harness = SSE 流式(stream=True —— 根除非流式客户端"干等满整段"造成的假超时,否则长思考会被误判超时),timeout 1800s · 并发 4 · max_tokens 32000,thinking-on,temp 0.6 / top_p 0.95。逐题记 finish_reason → stop(收口)/ length(撞 32K 墙、没答案)/ error。

维度	蒸馏 Q5_K_M	原版 Q5_K_M	Δ
GPQA-Diamond-198	81.82%(162/198)	68.69%(136/198)	+13.13pp
MMLU-500(5-shot)	90.0%(450/500)	89.6%(448/500)	+0.4pp
GPQA `finish=stop`(收口)	198 / 198	186 / 198
GPQA `finish=length`(撞 32K 墙、始终没答)	0	12
GPQA error(超时等)	0	0

解读:流式 harness 下(errors=0 证明零假超时),蒸馏每题都收口(198/198 stop、0 length),而原版有 12 题撞 32K 墙(length、始终给不出答案)。这是蒸馏「学会收口」最硬、最干净的证据 —— 由 finish_reason 量化,不只看准确率。

⚠️ 禁止跨量化档比:Q4_K_M 表用旧(非流式)harness,本 Q5_K_M 表用流式 harness。跨档比(如 base-Q4 vs base-Q5)无意义(harness 不同)。只有同档内蒸馏-vs-原版的 Δ 有效。

MTP(多 token 预测)单流加速 — 实测最佳配置 + 无损

本模型的 gguf 含原生 MTP 头（mainline llama.cpp --spec-type draft-mtp，无需 -md / 外挂 draft 模型）。

**最佳配置实测(单流,Q4_K_M-imatrix；测于 DGX Spark GB10,统一内存带宽受限 —— Mac / Blackwell RTX-50(FP4) 可更快)**：

`--spec-draft-n-max`(p-min=0)	单流 TPS	draft 接受率	平均接受长度
裸版 no-MTP	10.4	—	—
n-max=2	24.1	0.82	2.64
n-max=3 ⭐(推荐)	26.8	0.72	3.16
n-max=4	27.4	0.65	3.62

2.3–2.6× 单流加速（vs 裸版 10.4 TPS）；n-max=3 是吞吐/接受率平衡点。
贪心投机解码构造上无损：only accepts target-argmax token。批量 verify 的 GEMM 舍入会在 near-tie token 上产生字符级差异,这是 FP 非确定性、非质量损失（任意两次独立进程都会,哪怕都不开 MTP）。
投机=单流延迟工具,并发会退化（spec token 占 KV/batch 容量）—— 吞吐场景请用多并发裸版。
推荐启动:llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja

注:单流 TPS 因内容而异——编码类 prompt 接受率 ~0.72 → 26.8 t/s,推理类 ~0.58 → 24.6 t/s(n-max=3,均测于 DGX Spark GB10)。当前 MTP 为 base 原生 nextn 头嫁接(无损);base 头在蒸馏后偏移的推理分布上预测偏弱,故推理类接受率偏低。蒸馏专属重训 MTP 头(把接受率拉回 ~0.8)在路线图上。

各量化档 MTP 速度对比 / Per-quant MTP speed

所有 gguf 均焊原生 MTP。实测 DGX Spark GB10,单流,coding prompt,thinking-on,--spec-draft-n-max 3:

量化档 / Quant	体积 / Size	裸版 base TPS	MTP TPS	加速 / Speedup	接受率 / Accept
`Q4_K_M-imatrix`	~16 GB	10.4	26.8	2.6×	0.72
`Q5_K_M-imatrix`	18.2 GB	10.37	24.65	2.38×	0.72
`Q6_K-imatrix`	20.9 GB	9.18	22.07	2.40×	0.73
`Q8_0`	29 GB	7.82	17.12	2.19×	0.67

越小越快(内存带宽受限);MTP 全档 2.2–2.6× 加速,各档输出实测均正确(回文 / fibonacci 等编码题)。**Q8_0 ≈ BF16 质量**(8-bit 近无损;不带 imatrix——均匀 8-bit,重要性加权对它无意义)。 Smaller = faster (memory-bandwidth-bound); MTP gives 2.2–2.6× across all tiers; Q8_0 ≈ BF16 quality (near-lossless 8-bit).

评测口径 / Eval protocol

thinking-on;temp 0.6 / top_p 0.95(thinking 模型必需,greedy 会重复死循环);max_tokens 32768;read-timeout ≥2400s。同口径作用于所有对比模型。

局限 / Limitations

蒸思维方式,非蒸能力:黑盒 SFT 抬不高知识天花板。
工具执行结果是"模拟"的,不是真跑出来的:
- 本版(迁就方案):多轮工具调用里那一行行「执行结果」,是用一个小模型(DeepSeek-V4-Flash)扮演"运行环境"现编的,不是真在沙箱里跑代码得到的。选它纯粹是为了省成本、快——真实执行需要一整套"边生成边在真沙箱里跑、再把结果喂回 teacher 继续"的 agentic harness,慢且重;模拟一遍过就行。这是工程上的取舍,不是因为它更好。
- 代价(sim-to-real gap):模拟结果可能是错的(flash 会乐观地编一句"测试通过",但那段代码真跑其实会挂)→ 模型可能从"假成功"的轨迹里学到东西,甚至养成"自己瞎编工具返回值"的倾向。
- 最优方案(下一版补)= 真沙箱执行 + 拒绝采样:每一步工具调用都在真实环境里跑出真结果,再用判官只保留"真正把任务做对"的轨迹、扔掉失败的,从根上消除"假成功"。这条管线我们已经实现(真沙箱 + DS 判官),但本版数据未纳入,下一版蒸馏会用它重做。
- 注:模拟 ≠ 拒绝采样——模拟是"怎么拿到 observation"(编 vs 真跑),拒绝采样是"按真实结果筛掉做错的";模拟因为没真跑,反而让拒绝采样无从谈起。

文件 / Files

*.safetensors — BF16 合并权重(SGLang / vLLM / transformers)
gguf/ — 4 档原生 MTP GGUF(都焊 MTP;加 --spec-type draft-mtp 单流最快,不加即当普通 gguf 用,MTP 头不激活):
- …-MTP-Q4_K_M-imatrix.gguf(~16 GB,最快)
- …-MTP-Q5_K_M-imatrix.gguf(18.2 GB)
- …-MTP-Q6_K-imatrix.gguf(20.9 GB)
- …-MTP-Q8_0.gguf(29 GB,≈ BF16 质量,无 imatrix)
FP8(block-128 e4m3 + 原生 MTP,SGLang serving)在独立仓 Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8
**NVFP4 W4A16 风格**（质量优先，vLLM/SGLang 高并发）—— language MLP gate/up/down_proj NVFP4，其余模块保高精度。GPQA 82.83% / MMLU 87.80%。单流加速请用 GGUF + MTP。

推理 / Inference

thinking-on,务必 temp=0.6, top_p=0.95(切勿 greedy)。llama.cpp 用 gguf + --jinja(MTP 版加 --spec-type draft-mtp --spec-draft-n-max 3);SGLang/vLLM 用 safetensors。