Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8

🇬🇧 English · 🇨🇳 中文 ⬇️


🇬🇧 English

FP8 (block-128 e4m3) + native MTP build of Qwen3.6-27B-DSV4Pro-Thinking-Distill — for SGLang / vLLM serving (especially concurrent serving), where FP8 tensor cores + batching shine.

For the distillation method, attribution, eval protocol and limitations, see the parent BF16 model card. This card only covers the FP8 build.

What this is

  • FP8 block-128, e4m3, dynamic activation quantization of the distilled trunk — same scheme as the official Qwen/Qwen3.6-27B-FP8 (weight_scale_inv per 128×128 block). Norms, embeddings, lm_head, the small Gated-DeltaNet projections, and the vision tower are kept BF16 (matching the official layout).
  • Native MTP (nextn) head bundled (mtp.safetensors) for EAGLE/NEXTN speculative decoding.
  • Near-lossless by construction — block-128 FP8 is the same precision the Qwen team ships as "performance nearly identical to the original".

Speed — distill FP8 measured on DGX Spark (GB10, Blackwell sm_121), SGLang dev-cu13 + flashinfer + EAGLE/NEXTN

Mode Throughput MTP accept rate accept length
single-stream, no MTP (base) ~7.7 tok/s
single-stream, MTP on ~15 tok/s (→ ~19 on code) 0.65–0.89 2.6–3.55
concurrent ×16 — official-FP8 ref ≈136–143 tok/s aggregate

MTP gives ~2× single-stream (base 7.7 → 15 tok/s, up to ~19 / accept 0.89 on predictable code). All measured on this distilled FP8 (SGLang dev-cu13, EAGLE topk=1 / 3 draft / mamba extra_buffer); generation correct, no FP8 corruption. The concurrent figure is from the structurally-identical official Qwen/Qwen3.6-27B-FP8 (distill concurrent not separately benchmarked).

Where FP8+MTP wins: single-stream on a bandwidth-bound box is not its arena (a Q4_K_M-imatrix GGUF is faster there ~25 tok/s) — FP8+MTP's strength is concurrent serving: FP8 tensor cores + batching + near-lossless quality. On Blackwell RTX-50 (faster FP8) it's faster still.

Quality — formal same-harness FP8 eval

Both this distilled FP8 and the official base FP8 (Qwen/Qwen3.6-27B-FP8) were run through the identical streaming harness on DGX Spark (SGLang dev-cu13, thinking-on, temperature 0.6 / top_p 0.95, context 36864): GPQA-Diamond full-198 and MMLU-500 5-shot.

Dimension Distill FP8 Base FP8 Δ
GPQA-Diamond-198 82.32% (163/198) 66.67% (132/198) +15.65
MMLU-500 (5-shot) 90.00% (450/500) 88.60% (443/500) +1.40
GPQA finish=length (runaway) 0 12

The distilled FP8 is +15.65 pts on GPQA over the un-distilled base — and the headline is convergence: 0 length-truncations vs the base's 12. On hard GPQA items the un-distilled base lets its chain-of-thought run away until it slams into the token ceiling (those 12 unfinished answers are scored wrong); the distill was trained to rein exactly that in, while staying ahead on MMLU too. FP8 itself is near-lossless — numbers track the parent BF16 distill (canonical GPQA 80.81 / MMLU 91.8). Prior canonical reference (not re-run under FP8): coding-100 86 vs 83, Agentic SOLO-20 16 vs 13.

Serving (SGLang, validated launch)

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --speculative-algorithm EAGLE --speculative-num-steps 3 \
  --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --attention-backend flashinfer --trust-remote-code
  • Requires SGLang ≥ 0.5.10 (Qwen3.6). The extra_buffer mamba scheduler + SPEC_V2 are required for speculative decoding on the Gated-DeltaNet (mamba-style) layers. On Blackwell use --attention-backend flashinfer.
  • thinking-on, temperature=0.6, top_p=0.95 (never greedy).

Files

  • model-*-of-trunk.safetensors — FP8 (e4m3, block-128) distilled trunk + BF16 norms/embed/lm_head
  • mtp.safetensors — native nextn MTP head (FP8)
  • model-vision.safetensors — vision tower (BF16; unused for text)
  • config.json carries the quantization_config (fp8 e4m3)


🇨🇳 中文版

Qwen3.6-27B-DSV4Pro-Thinking-DistillFP8(block-128 e4m3)+ 原生 MTP 版 —— 面向 SGLang / vLLM 服务化(尤其并发 serving),FP8 张量核 + 批量在这里发挥。

蒸馏方法、出处归因、评测口径、局限父模型(BF16)卡片。本卡只讲 FP8 版。

这是什么

  • FP8 block-128、e4m3、动态激活量化蒸馏主干 —— 与官方 Qwen/Qwen3.6-27B-FP8 同方案(每 128×128 块一个 weight_scale_inv)。norm、embedding、lm_head、Gated-DeltaNet 小投影、vision 塔保持 BF16(对齐官方布局)。
  • 打包原生 MTP(nextn)头(mtp.safetensors),供 EAGLE/NEXTN 投机解码。
  • 构造上近无损 —— block-128 FP8 正是 Qwen 官方"性能与原模型几乎一致"的同精度。

速度 —— 蒸馏 FP8 实测 DGX Spark(GB10,Blackwell sm_121),SGLang dev-cu13 + flashinfer + EAGLE/NEXTN

模式 吞吐 MTP 接受率 接受长度
单流,关 MTP(base) ~7.7 tok/s
单流,开 MTP ~15 tok/s(代码段 → ~19) 0.65–0.89 2.6–3.55
并发 ×16 —— 官方 FP8 参考 ≈136–143 tok/s 聚合

MTP 单流约 2× 加速(base 7.7 → 15 tok/s,代码这类可预测内容升到 ~19 / 接受率 0.89)。均为本蒸馏 FP8 实测(SGLang dev-cu13,EAGLE topk=1 / 3 draft / mamba extra_buffer):生成正确、无 FP8 损坏。并发数来自结构一致的官方 Qwen/Qwen3.6-27B-FP8(蒸馏版并发未单独实测)。

FP8+MTP 的主场:在带宽受限的机器上单流不是它的强项(那里 Q4_K_M-imatrix GGUF 更快 ~25 tok/s)—— 它强在并发 serving:FP8 核 + 批量 + 近无损质量。Blackwell RTX-50(FP8 更快)上更快。

质量 —— 同 harness 正式 FP8 评测

蒸馏 FP8官方 base FP8(Qwen/Qwen3.6-27B-FP8)跑完全相同的流式 harness(DGX Spark,SGLang dev-cu13,thinking-on,temperature 0.6 / top_p 0.95,context 36864):GPQA-Diamond 全 198 + MMLU-500 5-shot。

维度 蒸馏 FP8 Base FP8 Δ
GPQA-Diamond-198 82.32% (163/198) 66.67% (132/198) +15.65
MMLU-500 (5-shot) 90.00% (450/500) 88.60% (443/500) +1.40
GPQA finish=length(跑飞) 0 12

蒸馏 FP8 在 GPQA 上 +15.65 分,而最亮的是收口:0 次长度截断 vs base 的 12 次。难题上未蒸馏的 base 思维链收不住、一路撞到 token 上限(那 12 个没答完的判错);蒸馏正是为修这个而训,同时 MMLU 也保持领先。FP8 本身近无损——数据贴合父 BF16 蒸馏版(canonical GPQA 80.81 / MMLU 91.8)。早前 canonical 参考(未在 FP8 下重跑):coding-100 86 vs 83,Agentic SOLO-20 16 vs 13。

服务化(SGLang,已验证启动)

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --speculative-algorithm EAGLE --speculative-num-steps 3 \
  --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --attention-backend flashinfer --trust-remote-code
  • SGLang ≥ 0.5.10(Qwen3.6)。extra_buffer mamba 调度 + SPEC_V2 是 Gated-DeltaNet(mamba 式)层做投机解码的必需项;Blackwell 用 --attention-backend flashinfer
  • thinking-on,temperature=0.6, top_p=0.95(切勿 greedy)。

文件

  • model-*-of-trunk.safetensors —— FP8(e4m3, block-128)蒸馏主干 + BF16 norm/embed/lm_head
  • mtp.safetensors —— 原生 nextn MTP 头(FP8)
  • model-vision.safetensors —— vision 塔(BF16,文本不用)
  • config.jsonquantization_config(fp8 e4m3)

Claude Code(实验性 / experimental)

Claude Code 不是本地 GGUF/HF 加载器——它对接兼容 Anthropic /v1/messages 的后端、且要求可靠的工具调用,不能直接喂仓名/路径。本 FP8 仓是 vLLM 路线的直接载体

vllm serve nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --served-model-name qwen36-27b-fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

Claude Code 里模型名填 qwen36-27b-fp8(即 --served-model-name不要填带 / 的仓名),ANTHROPIC_BASE_URL 指向你的 vLLM 端点。要 GGUF + LM Studio 路线见 **GGUF 仓**。参考 vLLM Claude Code


Claude Code (experimental)

Claude Code is not a local GGUF/HF loader — it talks to an Anthropic-compatible /v1/messages backend and needs reliable tool calling, so you cannot point it at a repo id directly. This FP8 repo is the direct vLLM path — serve it with --enable-auto-tool-choice --tool-call-parser qwen3_xml --reasoning-parser qwen3 (command above), then set Claude Code's model name to the --served-model-name (no slashes) and ANTHROPIC_BASE_URL to your vLLM endpoint. For the GGUF + LM Studio route, see the GGUF repo.

Downloads last month
288
Safetensors
Model size
28B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8

Base model

Qwen/Qwen3.6-27B
Quantized
(3)
this model