Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8
🇬🇧 English · 🇨🇳 中文 ⬇️
🇬🇧 English
FP8 (block-128 e4m3) + native MTP build of Qwen3.6-27B-DSV4Pro-Thinking-Distill — for SGLang / vLLM serving (especially concurrent serving), where FP8 tensor cores + batching shine.
For the distillation method, attribution, eval protocol and limitations, see the parent BF16 model card. This card only covers the FP8 build.
What this is
- FP8 block-128, e4m3, dynamic activation quantization of the distilled trunk — same scheme as the official
Qwen/Qwen3.6-27B-FP8(weight_scale_invper 128×128 block). Norms, embeddings,lm_head, the small Gated-DeltaNet projections, and the vision tower are kept BF16 (matching the official layout). - Native MTP (nextn) head bundled (
mtp.safetensors) for EAGLE/NEXTN speculative decoding. - Near-lossless by construction — block-128 FP8 is the same precision the Qwen team ships as "performance nearly identical to the original".
Speed — distill FP8 measured on DGX Spark (GB10, Blackwell sm_121), SGLang dev-cu13 + flashinfer + EAGLE/NEXTN
| Mode | Throughput | MTP accept rate | accept length |
|---|---|---|---|
| single-stream, no MTP (base) | ~7.7 tok/s | — | — |
| single-stream, MTP on | ~15 tok/s (→ ~19 on code) | 0.65–0.89 | 2.6–3.55 |
| concurrent ×16 — official-FP8 ref | ≈136–143 tok/s aggregate | — | — |
MTP gives ~2× single-stream (base 7.7 → 15 tok/s, up to ~19 / accept 0.89 on predictable code). All measured on this distilled FP8 (SGLang
dev-cu13, EAGLE topk=1 / 3 draft / mambaextra_buffer); generation correct, no FP8 corruption. The concurrent figure is from the structurally-identical officialQwen/Qwen3.6-27B-FP8(distill concurrent not separately benchmarked).
Where FP8+MTP wins: single-stream on a bandwidth-bound box is not its arena (a Q4_K_M-imatrix GGUF is faster there ~25 tok/s) — FP8+MTP's strength is concurrent serving: FP8 tensor cores + batching + near-lossless quality. On Blackwell RTX-50 (faster FP8) it's faster still.
Quality — formal same-harness FP8 eval
Both this distilled FP8 and the official base FP8 (Qwen/Qwen3.6-27B-FP8) were run through the identical streaming harness on DGX Spark (SGLang dev-cu13, thinking-on, temperature 0.6 / top_p 0.95, context 36864): GPQA-Diamond full-198 and MMLU-500 5-shot.
| Dimension | Distill FP8 | Base FP8 | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 82.32% (163/198) | 66.67% (132/198) | +15.65 |
| MMLU-500 (5-shot) | 90.00% (450/500) | 88.60% (443/500) | +1.40 |
GPQA finish=length (runaway) |
0 | 12 | — |
The distilled FP8 is +15.65 pts on GPQA over the un-distilled base — and the headline is convergence: 0 length-truncations vs the base's 12. On hard GPQA items the un-distilled base lets its chain-of-thought run away until it slams into the token ceiling (those 12 unfinished answers are scored wrong); the distill was trained to rein exactly that in, while staying ahead on MMLU too. FP8 itself is near-lossless — numbers track the parent BF16 distill (canonical GPQA 80.81 / MMLU 91.8). Prior canonical reference (not re-run under FP8): coding-100 86 vs 83, Agentic SOLO-20 16 vs 13.
Serving (SGLang, validated launch)
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
--speculative-algorithm EAGLE --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--attention-backend flashinfer --trust-remote-code
- Requires SGLang ≥ 0.5.10 (Qwen3.6). The
extra_buffermamba scheduler +SPEC_V2are required for speculative decoding on the Gated-DeltaNet (mamba-style) layers. On Blackwell use--attention-backend flashinfer. - thinking-on,
temperature=0.6, top_p=0.95(never greedy).
Files
model-*-of-trunk.safetensors— FP8 (e4m3, block-128) distilled trunk + BF16 norms/embed/lm_headmtp.safetensors— native nextn MTP head (FP8)model-vision.safetensors— vision tower (BF16; unused for text)config.jsoncarries thequantization_config(fp8 e4m3)
🇨🇳 中文版
Qwen3.6-27B-DSV4Pro-Thinking-Distill 的 FP8(block-128 e4m3)+ 原生 MTP 版 —— 面向 SGLang / vLLM 服务化(尤其并发 serving),FP8 张量核 + 批量在这里发挥。
蒸馏方法、出处归因、评测口径、局限见父模型(BF16)卡片。本卡只讲 FP8 版。
这是什么
- FP8 block-128、e4m3、动态激活量化蒸馏主干 —— 与官方
Qwen/Qwen3.6-27B-FP8同方案(每 128×128 块一个weight_scale_inv)。norm、embedding、lm_head、Gated-DeltaNet 小投影、vision 塔保持 BF16(对齐官方布局)。 - 打包原生 MTP(nextn)头(
mtp.safetensors),供 EAGLE/NEXTN 投机解码。 - 构造上近无损 —— block-128 FP8 正是 Qwen 官方"性能与原模型几乎一致"的同精度。
速度 —— 蒸馏 FP8 实测 DGX Spark(GB10,Blackwell sm_121),SGLang dev-cu13 + flashinfer + EAGLE/NEXTN
| 模式 | 吞吐 | MTP 接受率 | 接受长度 |
|---|---|---|---|
| 单流,关 MTP(base) | ~7.7 tok/s | — | — |
| 单流,开 MTP | ~15 tok/s(代码段 → ~19) | 0.65–0.89 | 2.6–3.55 |
| 并发 ×16 —— 官方 FP8 参考 | ≈136–143 tok/s 聚合 | — | — |
MTP 单流约 2× 加速(base 7.7 → 15 tok/s,代码这类可预测内容升到 ~19 / 接受率 0.89)。均为本蒸馏 FP8 实测(SGLang
dev-cu13,EAGLE topk=1 / 3 draft / mambaextra_buffer):生成正确、无 FP8 损坏。并发数来自结构一致的官方Qwen/Qwen3.6-27B-FP8(蒸馏版并发未单独实测)。
FP8+MTP 的主场:在带宽受限的机器上单流不是它的强项(那里 Q4_K_M-imatrix GGUF 更快 ~25 tok/s)—— 它强在并发 serving:FP8 核 + 批量 + 近无损质量。Blackwell RTX-50(FP8 更快)上更快。
质量 —— 同 harness 正式 FP8 评测
本蒸馏 FP8 与官方 base FP8(Qwen/Qwen3.6-27B-FP8)跑完全相同的流式 harness(DGX Spark,SGLang dev-cu13,thinking-on,temperature 0.6 / top_p 0.95,context 36864):GPQA-Diamond 全 198 + MMLU-500 5-shot。
| 维度 | 蒸馏 FP8 | Base FP8 | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 82.32% (163/198) | 66.67% (132/198) | +15.65 |
| MMLU-500 (5-shot) | 90.00% (450/500) | 88.60% (443/500) | +1.40 |
GPQA finish=length(跑飞) |
0 | 12 | — |
蒸馏 FP8 在 GPQA 上 +15.65 分,而最亮的是收口:0 次长度截断 vs base 的 12 次。难题上未蒸馏的 base 思维链收不住、一路撞到 token 上限(那 12 个没答完的判错);蒸馏正是为修这个而训,同时 MMLU 也保持领先。FP8 本身近无损——数据贴合父 BF16 蒸馏版(canonical GPQA 80.81 / MMLU 91.8)。早前 canonical 参考(未在 FP8 下重跑):coding-100 86 vs 83,Agentic SOLO-20 16 vs 13。
服务化(SGLang,已验证启动)
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
--speculative-algorithm EAGLE --speculative-num-steps 3 \
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--mamba-scheduler-strategy extra_buffer \
--attention-backend flashinfer --trust-remote-code
- 需 SGLang ≥ 0.5.10(Qwen3.6)。
extra_buffermamba 调度 +SPEC_V2是 Gated-DeltaNet(mamba 式)层做投机解码的必需项;Blackwell 用--attention-backend flashinfer。 - thinking-on,
temperature=0.6, top_p=0.95(切勿 greedy)。
文件
model-*-of-trunk.safetensors—— FP8(e4m3, block-128)蒸馏主干 + BF16 norm/embed/lm_headmtp.safetensors—— 原生 nextn MTP 头(FP8)model-vision.safetensors—— vision 塔(BF16,文本不用)config.json带quantization_config(fp8 e4m3)
Claude Code(实验性 / experimental)
Claude Code 不是本地 GGUF/HF 加载器——它对接兼容 Anthropic /v1/messages 的后端、且要求可靠的工具调用,不能直接喂仓名/路径。本 FP8 仓是 vLLM 路线的直接载体:
vllm serve nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
--served-model-name qwen36-27b-fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml
Claude Code 里模型名填 qwen36-27b-fp8(即 --served-model-name,不要填带 / 的仓名),ANTHROPIC_BASE_URL 指向你的 vLLM 端点。要 GGUF + LM Studio 路线见 **GGUF 仓**。参考 vLLM Claude Code。
Claude Code (experimental)
Claude Code is not a local GGUF/HF loader — it talks to an Anthropic-compatible /v1/messages backend and needs reliable tool calling, so you cannot point it at a repo id directly. This FP8 repo is the direct vLLM path — serve it with --enable-auto-tool-choice --tool-call-parser qwen3_xml --reasoning-parser qwen3 (command above), then set Claude Code's model name to the --served-model-name (no slashes) and ANTHROPIC_BASE_URL to your vLLM endpoint. For the GGUF + LM Studio route, see the GGUF repo.
- Downloads last month
- 288
Model tree for nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8
Base model
Qwen/Qwen3.6-27B