Qwen3.6-27B-DSV4Pro-Thinking-Distill

🇬🇧 English · 🇨🇳 中文 ⬇️


🇬🇧 English

On Qwen3.6-27B (Dense, 64 layers, Gated DeltaNet linear/full-attention hybrid), we use LoRA to distill the way DeepSeek-V4-Pro reasons (with thinking-on) plus its agentic behavior.

This is the Dense counterpart of the 35B-A3B (MoE) sister model: same R6000 GPU, same teacher, same recipe, swapped onto a Dense architecture — proving the gains come from the distilled thinking style, not an MoE architectural bonus. A native MTP head is welded on for single-stream acceleration.

⚠️ Distilling a thinking style ≠ distilling knowledge/capability: the goal is "learn how to reason and how to converge", not to inject knowledge or raise the capability ceiling.

Training details

  • Base: Qwen3.6-27B (Dense, BF16 base)
  • Method: LoRA, r = 64, α = 128, dropout = 0.05, targets = all attention + MLP projections
  • Optim: paged_adamw_8bit, cosine LR, warmup 0.03, ~1 epoch
  • Teacher: DeepSeek-V4-Pro (thinking-on + agentic)
  • Data: ~1842 distillation samples (lynn_prod spec). Trajectories = DS-V4-Pro multi-step reasoning under thinking-on (<think>) + ReAct-style tool calls (think one step → call one tool → observe → loop).
    • The tool "execution results" are SIMULATED, not actually run: in the multi-turn tool calls, each "execution result" line is improvised by a small, fast model (DeepSeek-V4-Flash) role-playing the "runtime" — not obtained by actually running code in a sandbox. So it differs from real execution.
    • Training masks those fabricated results — the model learns only "how to think / how to call tools", not the made-up outputs: because the results are fake, training on them would teach the model the bad habit of fabricating tool return values; so we optimize only the model's own "reasoning + tool-call" tokens.
  • Artifacts: merged → BF16 safetensors → gguf/ Q4_K_M-imatrix (with native MTP)

Attribution (the method is not original — it is a combination of published techniques)

  • ReAct (interleaved reasoning + acting): Yao et al., 2022, arXiv:2210.03629 (ICLR 2023)
  • STaR (bootstrapping reasoning traces): Zelikman et al., 2022, arXiv:2203.14465
  • Self-Instruct / Baize self-chat: Wang et al., 2022; Xu et al., 2023, arXiv:2304.01196
  • AgentTuning: Zeng et al., 2023, arXiv:2310.12823
  • ToolBench / ToolLLM (tool use): Qin et al., 2023, arXiv:2307.16789
  • DeepSeek-R1 reasoning distillation: DeepSeek-AI, 2025, arXiv:2501.12948

Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B

Quantization parity (important — prevents misreading): this model and the base are both Q4_K_M (imatrix-corrected) GGUF + native MTP, fully same-spec — same Q4_K_M, same imatrix, same MTP. The only variable is "distilled or not". This is not distilled-Q4_K_M vs base-BF16 (which would be unfair); the Δ below is cleanly attributable to distillation itself, with no quantization difference mixed in.

Dimension This model (distill) Original base Δ
GPQA-Diamond-198 80.81% (160/198, 32K) 73.7% (146/198) +7.1
MMLU-500 (5-shot) 91.8% (459/500) 91.6% +0.2
GPQA unconverged empty answers (parse_fail) 0 14 −14
coding-100 (10 langs × 10) 86/100 83/100 +3
Agentic SOLO (20 complex tasks) 16/20 13/20 +3

Reading: hard reasoning improves markedly (GPQA +7.1pp, 160/198 = 80.81%, 0 error / 0 parse_fail), and knowledge does not drop — it even nudges up (MMLU +0.2pp), while "finish thinking, then converge" holds — GPQA unconverged empty answers fall from 14 to 0 (the base's 14 were all cases that thought to the 32K limit without ever giving an answer; after distillation, zero). Median generation length is compressed to ~3006 tokens. Note: the 35B-A3B distill lost 1.6pp MMLU, whereas the 27B Dense has more capacity — it fits the distillation without crowding out knowledge: GPQA up, MMLU not down, a cleaner "pure gain".

coding-100: same harness, a real sandbox runs the code and checks whether the tests actually pass (objective). distill 86 ≥ base 83 — coding ability did not drop, and is slightly higher.

Agentic SOLO: the model orchestrates + executes 20 complex tasks by itself; judge = the task/harness author (who knows best whether it was "actually done"). distill 16 > base 13. ⚠️ This metric is judge-subjective (a stricter judge ties the two), so treat it as a trend — the hard numbers are GPQA / coding.

Q5_K_M evaluation — distill vs base, streaming harness (re-run)

Protocol (annotated — DIFFERENT from the Q4_K_M section above; do not cross-compare tiers): both this distill and the base are Q5_K_M-imatrix GGUF + native MTP, same-spec, served base-mode (MTP off) for a concurrency eval. Distill harness = SSE streaming (stream=True), timeout 1800s · concurrency 4 · max_tokens 32000, thinking-on, temp 0.6 / top_p 0.95, finish_reason logged per question. Base GPQA result is from the original conc=4 run (non-streaming), but finish_reason data confirms 0 errors (zero false timeouts) — the 12 length hits each show completion_tokens=32768, i.e. genuinely unconverged, not harness artifacts. Base MMLU re-run uses the same SSE streaming harness as distill.

Dimension Distill Q5_K_M Base Q5_K_M Δ
GPQA-Diamond-198 81.82% (162/198) 68.69% (136/198) +13.13pp
MMLU-500 (5-shot) 90.0% (450/500) 89.6% (448/500) +0.4pp
GPQA finish=stop (converged) 198 / 198 186 / 198
GPQA finish=length (hit 32K wall, never answered) 0 12
GPQA errors (timeout/etc.) 0 0

Reading: under the streaming harness (zero false-timeouts, confirmed by errors=0), the distill converges on every single question (198/198 stop, 0 length), while the base runs into the 32K wall on 12 questions (length, never produces an answer). This is the hardest, cleanest evidence of the distillation's "learn to converge / 收口" effect — now quantified by finish_reason, not just accuracy.

⚠️ Do NOT cross-compare quant tiers: the Q4_K_M table uses an older (non-streaming) harness; this Q5_K_M table uses the streaming harness. Comparing e.g. base-Q4 vs base-Q5 across tiers is meaningless (harness differs). Only the within-tier distill-vs-base Δ is valid.

MTP (multi-token prediction) single-stream acceleration — measured best config + lossless

两条 MTP 通路:GGUF 走 llama.cpp(--spec-type draft-mtp,见下);BF16 / FP8 safetensors 走 vLLM / SGLang(--speculative-config '{"method":"mtp","num_speculative_tokens":3}')—— BF16 与 FP8 仓现均已焊原生 nextn 头(SGLang 实测 accept 0.76–0.88;BF16 在 Ampere 等无原生 FP8 的卡上更合适)。 Two MTP paths: GGUF via llama.cpp (--spec-type draft-mtp, below); BF16 / FP8 safetensors via vLLM / SGLang (--speculative-config '{"method":"mtp","num_speculative_tokens":3}') — both repos now bundle the native nextn head (SGLang-measured accept 0.76–0.88).

This model's gguf contains a native MTP head (mainline llama.cpp --spec-type draft-mtp; no -md / external draft model needed).

Best config measured (single-stream, Q4_K_M-imatrix; tested on DGX Spark GB10, unified-memory bandwidth-bound — Mac / Blackwell RTX-50 (FP4) can be faster):

--spec-draft-n-max (p-min=0) single-stream TPS draft accept rate mean accept len
bare no-MTP 10.4
n-max=2 24.1 0.82 2.64
n-max=3 ⭐ (recommended) 26.8 0.72 3.16
n-max=4 27.4 0.65 3.62
  • 2.3–2.6× single-stream speedup (vs bare 10.4 TPS); n-max=3 is the throughput/accept-rate balance point.
  • Greedy speculative decoding is lossless by construction: it only accepts the target-argmax token. Batched-verify GEMM rounding produces character-level differences on near-tie tokens — this is FP non-determinism, not quality loss (any two independent runs show it, even with MTP off).
  • Speculation is a single-stream latency tool; concurrency degrades it (spec tokens take up KV/batch capacity) — for throughput scenarios use bare multi-concurrency mode.
  • Recommended launch: llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja

Note: single-stream TPS varies by content — coding prompts accept ~0.72 → 26.8 t/s, reasoning prompts ~0.58 → 24.6 t/s (n-max=3, all measured on DGX Spark GB10). The current MTP is the base's native nextn head grafted on (lossless); the base head predicts a bit weakly on the post-distillation reasoning distribution, so the reasoning accept rate is lower. A distill-specific retrained MTP head (to pull accept back to ~0.8) is on the roadmap.

Eval protocol

thinking-on; temp 0.6 / top_p 0.95 (required for thinking models — greedy loops to death); max_tokens 32768; read-timeout ≥ 2400s. The same spec is applied to every compared model.

Limitations

  • Distills thinking style, not capability: black-box SFT cannot raise the knowledge ceiling.
  • Tool execution results are "simulated", not actually run:
    • This version (compromise): each "execution result" line in the multi-turn tool calls is improvised by a small model (DeepSeek-V4-Flash) role-playing the "runtime", not obtained by actually running code in a sandbox. Chosen purely for cost and speed — real execution needs a full "generate → run in a real sandbox → feed results back to the teacher → continue" agentic harness, which is slow and heavy; one simulated pass is enough. This is an engineering trade-off, not because it is better.
    • Cost (sim-to-real gap): a simulated result can be wrong (Flash may optimistically fabricate "tests passed" when that code would actually crash) → the model can learn from "fake-success" trajectories, and may even acquire the tendency to fabricate tool return values itself.
    • Optimal approach (coming in the next version) = real-sandbox execution + rejection sampling: every tool call runs in a real environment to get a real result, then a judge keeps only the trajectories that genuinely solved the task and discards the failed ones — eliminating "fake success" at the root. We have already implemented this pipeline (real sandbox + DS judge), but this version's data did not use it; the next distillation will be redone with it.
    • Note: simulation ≠ rejection sampling — simulation is about "how the observation is obtained" (fabricate vs. really run); rejection sampling is about "filtering out the wrong ones by real outcome". Because simulation never really runs, it leaves no ground on which rejection sampling could even operate.

Files

  • *.safetensors — BF16 merged weights (SGLang / vLLM / transformers)
  • gguf/Qwen3.6-27B-DSV4Pro-Distill-MTP-Q4_K_M-imatrix.ggufthe only GGUF, native MTP version (Q4_K_M-imatrix). Add --spec-type draft-mtp for the fastest single-stream; without that flag it is just a normal Q4_K_M model (MTP head inactive) — so no separate "non-MTP plain version" is provided, to keep anyone from downloading the wrong file and thinking it lacks MTP.
  • NVFP4 (W4A16-style) — quality-first ModelOpt NVFP4. Language MLP gate/up/down_proj compressed to FP4; attention, Mamba, vision, embeddings, lm_head, norms kept high-precision. vLLM/SGLang high-concurrency. GPQA 82.83% / MMLU 87.80%. Single-stream MTP → use GGUF.

Inference

thinking-on, always use temp=0.6, top_p=0.95 (never greedy). llama.cpp: gguf + --jinja (MTP version add --spec-type draft-mtp --spec-draft-n-max 3); SGLang / vLLM: safetensors.



🇨🇳 中文版

Qwen3.6-27BDense,64 层,Gated DeltaNet 线性/全注意力混合)上,用 LoRA 蒸馏 DeepSeek-V4-Pro 在「思考开启(thinking-on)」时的思维方式 + agentic 行为

这是 35B-A3B(MoE)姊妹版的 Dense 复现:同一台 R6000、同一 teacher、同一套配方,换到 Dense 架构——证明提升来自蒸进去的思维方式,不是 MoE 架构红利。并焊了原生 MTP 做单流加速。

⚠️ 蒸思维方式 ≠ 蒸知识/能力:目标是「学会怎么想、怎么收口」,不是蒸知识或扩能力上限。

训练配置(如实披露)

  • 基座 Base:Qwen3.6-27B(Dense,BF16 基座)
  • 方法 MethodLoRAr = 64,α = 128,dropout = 0.05,target = 全部注意力 + MLP 投影
  • 优化:paged_adamw_8bit,cosine LR,warmup 0.03,约 1 epoch
  • Teacher:DeepSeek-V4-Pro(thinking-on + agentic)
  • 数据 Data:~1842 条蒸馏样本(lynn_prod 口径)。轨迹 = DS-V4-Pro 在 thinking-on 下的多步推理(<think>)+ ReAct 式工具调用(想一步 → 调一次工具 → 看结果,循环)。
    • 工具的「执行结果」是模拟的,不是真跑的:多轮工具调用里那一行行「执行结果」,是用一个又小又快的模型(DeepSeek-V4-Flash)扮演"运行环境"现编出来的,并不是真的在沙箱里跑代码得到的——所以和真实运行有差距。
    • **训练时只学"怎么想、怎么调工具",不学那些编出来的"执行结果"**:因为执行结果是假的,如果让模型去学它,模型就会养成"自己瞎编工具返回值"的坏习惯;所以我们只优化模型自己产出的「思考 + 工具调用」部分。
  • 产物:合并 → BF16 safetensors → gguf/ Q4_K_M-imatrix(含原生 MTP 版)

方法非自创,是公开技术的组合(如实归因)

  • ReAct(推理+行动交替):Yao et al., 2022, arXiv:2210.03629(ICLR 2023)
  • **STaR(reasoning trace 自举)**:Zelikman et al., 2022, arXiv:2203.14465
  • Self-Instruct / Baize 自对话:Wang et al., 2022;Xu et al., 2023, arXiv:2304.01196
  • AgentTuning:Zeng et al., 2023, arXiv:2310.12823
  • **ToolBench / ToolLLM(工具调用)**:Qin et al., 2023, arXiv:2307.16789
  • DeepSeek-R1 推理蒸馏:DeepSeek-AI, 2025, arXiv:2501.12948

评测(Q4_K_M,旧 harness):同一 harness,thinking-on,vs 原版 Qwen3.6-27B

量化口径(重要,防误读):本模型与原版 base 都是 Q4_K_M(imatrix 校正)GGUF + 原生 MTP,完全同口径 —— 同 Q4_K_M、同 imatrix、同 MTP,唯一变量是"是否蒸馏"不是拿蒸馏-Q4_K_M 去比 base-BF16(那样不公平);下面的 Δ 干净地归因于蒸馏本身,不掺量化差异。

维度 本模型(蒸馏) 原版 base Δ
GPQA-Diamond-198 80.81%(160/198,32K) 73.7%(146/198) +7.1
MMLU-500 (5-shot) 91.8%(459/500) 91.6% +0.2
GPQA 未收口空答 (parse_fail) 0 14 -14
coding-100 (10 语言×10) 86/100 83/100 +3
Agentic SOLO (20 复杂任务) 16/20 13/20 +3

解读硬推理显著提升(GPQA +7.1pp,160/198=80.81%,0 error/0 parse_fail)、知识不降反微涨(MMLU +0.2pp),且「想完就收口」—— GPQA 未收口空答从 14 降到 0(base 那 14 个全是思考到 32K 上限还没给答案;蒸馏后彻底归零)。中位生成长度压到 ~3006 token。注:35B-A3B 蒸馏 MMLU 掉 1.6pp,27B Dense 容量更大、装得下蒸馏而不挤占知识 —— **GPQA 涨、MMLU 不降,更干净的"纯赚"**。

coding-100:同一 harness、真沙箱跑代码看测试是否真过(客观)。distill 86 ≥ base 83,coding 能力没掉、还略高。

Agentic SOLO:模型自己编排+自己执行 20 道复杂任务,判官 = 出题/harness 作者(对"做没做到"最清楚)。distill 16 > base 13。⚠️ 此项判官主观性强(换更严判官两者打平),作趋势参考,硬指标看 GPQA/coding。

Q5_K_M 评测 —— 蒸馏 vs 原版,流式 harness(重测)

口径(已标注 —— 与上方 Q4_K_M 段口径不同,禁止跨档比):本蒸馏与原版**均为 Q5_K_M-imatrix GGUF + 原生 MTP、同规格、base 模式(MTP 关)**跑并发评测。Harness = SSE 流式(stream=True —— 根除非流式客户端"干等满整段"造成的假超时,否则长思考会被误判超时),timeout 1800s · 并发 4 · max_tokens 32000,thinking-on,temp 0.6 / top_p 0.95。逐题记 finish_reasonstop(收口)/ length(撞 32K 墙、没答案)/ error

维度 蒸馏 Q5_K_M 原版 Q5_K_M Δ
GPQA-Diamond-198 81.82%(162/198) 68.69%(136/198) +13.13pp
MMLU-500(5-shot) 90.0%(450/500) 89.6%(448/500) +0.4pp
GPQA finish=stop(收口) 198 / 198 186 / 198
GPQA finish=length(撞 32K 墙、始终没答) 0 12
GPQA error(超时等) 0 0

解读:流式 harness 下(errors=0 证明零假超时),蒸馏每题都收口(198/198 stop、0 length),而原版有 12 题撞 32K 墙(length、始终给不出答案)。这是蒸馏「学会收口」最硬、最干净的证据 —— 由 finish_reason 量化,不只看准确率。

⚠️ 禁止跨量化档比:Q4_K_M 表用旧(非流式)harness,本 Q5_K_M 表用流式 harness。跨档比(如 base-Q4 vs base-Q5)无意义(harness 不同)。只有同档内 蒸馏-vs-原版 的 Δ 有效。

MTP(多 token 预测)单流加速 — 实测最佳配置 + 无损

本模型的 gguf 含原生 MTP 头(mainline llama.cpp --spec-type draft-mtp,无需 -md / 外挂 draft 模型)。

**最佳配置实测(单流,Q4_K_M-imatrix;测于 DGX Spark GB10,统一内存带宽受限 —— Mac / Blackwell RTX-50(FP4) 可更快)**:

--spec-draft-n-max(p-min=0) 单流 TPS draft 接受率 平均接受长度
裸版 no-MTP 10.4
n-max=2 24.1 0.82 2.64
n-max=3 ⭐(推荐) 26.8 0.72 3.16
n-max=4 27.4 0.65 3.62
  • 2.3–2.6× 单流加速(vs 裸版 10.4 TPS);n-max=3 是吞吐/接受率平衡点。
  • 贪心投机解码构造上无损:only accepts target-argmax token。批量 verify 的 GEMM 舍入会在 near-tie token 上产生字符级差异,这是 FP 非确定性、非质量损失(任意两次独立进程都会,哪怕都不开 MTP)。
  • 投机=单流延迟工具,并发会退化(spec token 占 KV/batch 容量)—— 吞吐场景请用多并发裸版。
  • 推荐启动:llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja

注:单流 TPS 因内容而异——编码类 prompt 接受率 ~0.72 → 26.8 t/s,推理类 ~0.58 → 24.6 t/s(n-max=3,均测于 DGX Spark GB10)。当前 MTP 为 base 原生 nextn 头嫁接(无损);base 头在蒸馏后偏移的推理分布上预测偏弱,故推理类接受率偏低。蒸馏专属重训 MTP 头(把接受率拉回 ~0.8)在路线图上。

各量化档 MTP 速度对比 / Per-quant MTP speed

所有 gguf 均焊原生 MTP。实测 DGX Spark GB10,单流,coding prompt,thinking-on,--spec-draft-n-max 3:

量化档 / Quant 体积 / Size 裸版 base TPS MTP TPS 加速 / Speedup 接受率 / Accept
Q4_K_M-imatrix ~16 GB 10.4 26.8 2.6× 0.72
Q5_K_M-imatrix 18.2 GB 10.37 24.65 2.38× 0.72
Q6_K-imatrix 20.9 GB 9.18 22.07 2.40× 0.73
Q8_0 29 GB 7.82 17.12 2.19× 0.67

越小越快(内存带宽受限);MTP 全档 2.2–2.6× 加速,各档输出实测均正确(回文 / fibonacci 等编码题)。**Q8_0 ≈ BF16 质量**(8-bit 近无损;不带 imatrix——均匀 8-bit,重要性加权对它无意义)。 Smaller = faster (memory-bandwidth-bound); MTP gives 2.2–2.6× across all tiers; Q8_0 ≈ BF16 quality (near-lossless 8-bit).

评测口径 / Eval protocol

thinking-on;temp 0.6 / top_p 0.95(thinking 模型必需,greedy 会重复死循环);max_tokens 32768;read-timeout ≥2400s。同口径作用于所有对比模型。

局限 / Limitations

  • 蒸思维方式,非蒸能力:黑盒 SFT 抬不高知识天花板。
  • 工具执行结果是"模拟"的,不是真跑出来的:
    • 本版(迁就方案):多轮工具调用里那一行行「执行结果」,是用一个小模型(DeepSeek-V4-Flash)扮演"运行环境"现编的,不是真在沙箱里跑代码得到的。选它纯粹是为了省成本、快——真实执行需要一整套"边生成边在真沙箱里跑、再把结果喂回 teacher 继续"的 agentic harness,慢且重;模拟一遍过就行。这是工程上的取舍,不是因为它更好。
    • 代价(sim-to-real gap):模拟结果可能是错的(flash 会乐观地编一句"测试通过",但那段代码真跑其实会挂)→ 模型可能从"假成功"的轨迹里学到东西,甚至养成"自己瞎编工具返回值"的倾向。
    • 最优方案(下一版补)= 真沙箱执行 + 拒绝采样:每一步工具调用都在真实环境里跑出真结果,再用判官只保留"真正把任务做对"的轨迹、扔掉失败的,从根上消除"假成功"。这条管线我们已经实现(真沙箱 + DS 判官),但本版数据未纳入,下一版蒸馏会用它重做。
    • 注:模拟 ≠ 拒绝采样——模拟是"怎么拿到 observation"(编 vs 真跑),拒绝采样是"按真实结果筛掉做错的";模拟因为没真跑,反而让拒绝采样无从谈起。

文件 / Files

  • *.safetensors — BF16 合并权重(SGLang / vLLM / transformers)
  • gguf/ — 4 档原生 MTP GGUF(都焊 MTP;加 --spec-type draft-mtp 单流最快,不加即当普通 gguf 用,MTP 头不激活):
    • …-MTP-Q4_K_M-imatrix.gguf(~16 GB,最快)
    • …-MTP-Q5_K_M-imatrix.gguf(18.2 GB)
    • …-MTP-Q6_K-imatrix.gguf(20.9 GB)
    • …-MTP-Q8_0.gguf(29 GB,≈ BF16 质量,无 imatrix)
  • FP8(block-128 e4m3 + 原生 MTP,SGLang serving)在独立仓 Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8
  • **NVFP4 W4A16 风格**(质量优先,vLLM/SGLang 高并发)—— language MLP gate/up/down_proj NVFP4,其余模块保高精度。GPQA 82.83% / MMLU 87.80%。单流加速请用 GGUF + MTP。

推理 / Inference

thinking-on,务必 temp=0.6, top_p=0.95(切勿 greedy)。llama.cpp 用 gguf + --jinja(MTP 版加 --spec-type draft-mtp --spec-draft-n-max 3);SGLang/vLLM 用 safetensors。

Downloads last month
5,911
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill

Base model

Qwen/Qwen3.6-27B
Quantized
(521)
this model
Quantizations
2 models

Papers for nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill