Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8

🇬🇧 English · 🇨🇳 中文 ⬇️

🇬🇧 English

FP8 (block-128 e4m3) + native MTP build of Qwen3.6-27B-DSV4Pro-Thinking-Distill — for SGLang / vLLM serving (especially concurrent serving), where FP8 tensor cores + batching shine.

For the distillation method, attribution, eval protocol and limitations, see the parent BF16 model card. This card only covers the FP8 build.

What this is

FP8 block-128, e4m3, dynamic activation quantization of the distilled trunk — same scheme as the official Qwen/Qwen3.6-27B-FP8 (weight_scale_inv per 128×128 block). Norms, embeddings, lm_head, the small Gated-DeltaNet projections, and the vision tower are kept BF16 (matching the official layout).
Native MTP (nextn) head bundled (mtp.safetensors) for EAGLE/NEXTN speculative decoding.
Near-lossless by construction — block-128 FP8 is the same precision the Qwen team ships as "performance nearly identical to the original".

Speed — distill FP8 measured on DGX Spark (GB10, Blackwell sm_121), SGLang `dev-cu13` + flashinfer + EAGLE/NEXTN

Mode	Throughput	MTP accept rate	accept length
single-stream, no MTP (base)	~7.7 tok/s	—	—
single-stream, MTP on	~15 tok/s (→ ~19 on code)	0.65–0.89	2.6–3.55
concurrent ×16 — official-FP8 ref	≈136–143 tok/s aggregate	—	—

MTP gives ~2× single-stream (base 7.7 → 15 tok/s, up to ~19 / accept 0.89 on predictable code). All measured on this distilled FP8 (SGLang dev-cu13, EAGLE topk=1 / 3 draft / mamba extra_buffer); generation correct, no FP8 corruption. The concurrent figure is from the structurally-identical official Qwen/Qwen3.6-27B-FP8 (distill concurrent not separately benchmarked).

Where FP8+MTP wins: single-stream on a bandwidth-bound box is not its arena (a Q4_K_M-imatrix GGUF is faster there ~25 tok/s) — FP8+MTP's strength is concurrent serving: FP8 tensor cores + batching + near-lossless quality. On Blackwell RTX-50 (faster FP8) it's faster still.

Quality — formal same-harness FP8 eval

Both this distilled FP8 and the official base FP8 (Qwen/Qwen3.6-27B-FP8) were run through the identical streaming harness on DGX Spark (SGLang dev-cu13, thinking-on, temperature 0.6 / top_p 0.95, context 36864): GPQA-Diamond full-198 and MMLU-500 5-shot.

Dimension	Distill FP8	Base FP8	Δ
GPQA-Diamond-198	82.32% (163/198)	66.67% (132/198)	+15.65
MMLU-500 (5-shot)	90.00% (450/500)	88.60% (443/500)	+1.40
GPQA `finish=length` (runaway)	0	12	—

The distilled FP8 is +15.65 pts on GPQA over the un-distilled base — and the headline is convergence: 0 length-truncations vs the base's 12. On hard GPQA items the un-distilled base lets its chain-of-thought run away until it slams into the token ceiling (those 12 unfinished answers are scored wrong); the distill was trained to rein exactly that in, while staying ahead on MMLU too. FP8 itself is near-lossless — numbers track the parent BF16 distill (canonical GPQA 80.81 / MMLU 91.8). Prior canonical reference (not re-run under FP8): coding-100 86 vs 83, Agentic SOLO-20 16 vs 13.

Serving (SGLang, validated launch)

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --speculative-algorithm EAGLE --speculative-num-steps 3 \
  --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --attention-backend flashinfer --trust-remote-code

Requires SGLang ≥ 0.5.10 (Qwen3.6). The extra_buffer mamba scheduler + SPEC_V2 are required for speculative decoding on the Gated-DeltaNet (mamba-style) layers. On Blackwell use --attention-backend flashinfer.
thinking-on, temperature=0.6, top_p=0.95 (never greedy).

Files

model-*-of-trunk.safetensors — FP8 (e4m3, block-128) distilled trunk + BF16 norms/embed/lm_head
mtp.safetensors — native nextn MTP head (FP8)
model-vision.safetensors — vision tower (BF16; unused for text)
config.json carries the quantization_config (fp8 e4m3)

🇨🇳 中文版

Qwen3.6-27B-DSV4Pro-Thinking-Distill 的 FP8(block-128 e4m3)+ 原生 MTP 版 —— 面向 SGLang / vLLM 服务化(尤其并发 serving),FP8 张量核 + 批量在这里发挥。

蒸馏方法、出处归因、评测口径、局限见父模型(BF16)卡片。本卡只讲 FP8 版。

这是什么

FP8 block-128、e4m3、动态激活量化蒸馏主干 —— 与官方 Qwen/Qwen3.6-27B-FP8 同方案(每 128×128 块一个 weight_scale_inv)。norm、embedding、lm_head、Gated-DeltaNet 小投影、vision 塔保持 BF16(对齐官方布局)。
打包原生 MTP(nextn)头(mtp.safetensors),供 EAGLE/NEXTN 投机解码。
构造上近无损 —— block-128 FP8 正是 Qwen 官方"性能与原模型几乎一致"的同精度。

速度 —— 蒸馏 FP8 实测 DGX Spark(GB10,Blackwell sm_121),SGLang `dev-cu13` + flashinfer + EAGLE/NEXTN

模式	吞吐	MTP 接受率	接受长度
单流,关 MTP(base)	~7.7 tok/s	—	—
单流,开 MTP	~15 tok/s(代码段 → ~19)	0.65–0.89	2.6–3.55
并发 ×16 —— 官方 FP8 参考	≈136–143 tok/s 聚合	—	—

MTP 单流约 2× 加速(base 7.7 → 15 tok/s,代码这类可预测内容升到 ~19 / 接受率 0.89)。均为本蒸馏 FP8 实测(SGLang dev-cu13,EAGLE topk=1 / 3 draft / mamba extra_buffer):生成正确、无 FP8 损坏。并发数来自结构一致的官方 Qwen/Qwen3.6-27B-FP8(蒸馏版并发未单独实测)。

FP8+MTP 的主场:在带宽受限的机器上单流不是它的强项(那里 Q4_K_M-imatrix GGUF 更快 ~25 tok/s)—— 它强在并发 serving:FP8 核 + 批量 + 近无损质量。Blackwell RTX-50(FP8 更快)上更快。

质量 —— 同 harness 正式 FP8 评测

本蒸馏 FP8 与官方 base FP8(Qwen/Qwen3.6-27B-FP8)跑完全相同的流式 harness(DGX Spark,SGLang dev-cu13,thinking-on,temperature 0.6 / top_p 0.95,context 36864):GPQA-Diamond 全 198 + MMLU-500 5-shot。

维度	蒸馏 FP8	Base FP8	Δ
GPQA-Diamond-198	82.32% (163/198)	66.67% (132/198)	+15.65
MMLU-500 (5-shot)	90.00% (450/500)	88.60% (443/500)	+1.40
GPQA `finish=length`(跑飞)	0	12	—

蒸馏 FP8 在 GPQA 上 +15.65 分,而最亮的是收口:0 次长度截断 vs base 的 12 次。难题上未蒸馏的 base 思维链收不住、一路撞到 token 上限(那 12 个没答完的判错);蒸馏正是为修这个而训,同时 MMLU 也保持领先。FP8 本身近无损——数据贴合父 BF16 蒸馏版(canonical GPQA 80.81 / MMLU 91.8)。早前 canonical 参考(未在 FP8 下重跑):coding-100 86 vs 83,Agentic SOLO-20 16 vs 13。

服务化(SGLang,已验证启动)

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
  --model-path Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --speculative-algorithm EAGLE --speculative-num-steps 3 \
  --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
  --mamba-scheduler-strategy extra_buffer \
  --attention-backend flashinfer --trust-remote-code

需 SGLang ≥ 0.5.10(Qwen3.6)。extra_buffer mamba 调度 + SPEC_V2 是 Gated-DeltaNet(mamba 式)层做投机解码的必需项;Blackwell 用 --attention-backend flashinfer。
thinking-on,temperature=0.6, top_p=0.95(切勿 greedy)。

文件

model-*-of-trunk.safetensors —— FP8(e4m3, block-128)蒸馏主干 + BF16 norm/embed/lm_head
mtp.safetensors —— 原生 nextn MTP 头(FP8)
model-vision.safetensors —— vision 塔(BF16,文本不用)
config.json 带 quantization_config(fp8 e4m3)

Claude Code（实验性 / experimental）

Claude Code 不是本地 GGUF/HF 加载器——它对接兼容 Anthropic /v1/messages 的后端、且要求可靠的工具调用，不能直接喂仓名/路径。本 FP8 仓是 vLLM 路线的直接载体：

vllm serve nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 \
  --served-model-name qwen36-27b-fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml

Claude Code 里模型名填 qwen36-27b-fp8（即 --served-model-name，不要填带 / 的仓名），ANTHROPIC_BASE_URL 指向你的 vLLM 端点。要 GGUF + LM Studio 路线见 **GGUF 仓**。参考 vLLM Claude Code。

Claude Code (experimental)

Claude Code is not a local GGUF/HF loader — it talks to an Anthropic-compatible /v1/messages backend and needs reliable tool calling, so you cannot point it at a repo id directly. This FP8 repo is the direct vLLM path — serve it with --enable-auto-tool-choice --tool-call-parser qwen3_xml --reasoning-parser qwen3 (command above), then set Claude Code's model name to the --served-model-name (no slashes) and ANTHROPIC_BASE_URL to your vLLM endpoint. For the GGUF + LM Studio route, see the GGUF repo.

Downloads last month: 288

Safetensors

Model size

28B params

Tensor type

BF16

F8_E4M3

Model tree for nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8

Base model

Qwen/Qwen3.6-27B

Quantized

nerkyor/Qwen3.6-27B-DSV4Pro-Thinking-Distill