GLM-5.2 NVFP4 + MTP (NEXTN) β layer-78 graft diff
This is a diff/overlay on top of Mapika/GLM-5.2-NVFP4 that enables MTP / NEXTN speculative decoding for GLM-5.2 in NVFP4 precision. It is not a standalone full checkpoint.
What this is
Mapika/GLM-5.2-NVFP4 ships layers 0β77 in NVFP4 but omits layer 78 (the MTP/NEXTN draft layer). Its config.json
already declares num_nextn_predict_layers=1, yet SGLang silently skips the missing draft weights, so MTP never fires
(garbage drafts, no error). This repo adds exactly that missing layer.
model-00046-of-00046.safetensors(~19.9 GB, BF16) β layer 78, the MTP draft decoder. Grafted from the original BF16 checkpoint zai-org/GLM-5.2 (shards model-00270β00274-of-00282). All 791 tensors (790 BF16 + 1 fp32e_score_correction_bias) are BF16.model.safetensors.index.jsonβ rebuilt so the 791 layer-78 weights point at the new shard-46 (total_sizeupdated 439842893136 β 459748734800).config.json,generation_config.json,chat_template.jinja,hf_quant_config.json, tokenizer files β copies from Mapika for standalone usability.benchmark/β GSM8K accuracy + concurrency sweep results (see below).
Why layer 78 must be BF16
SGLang (glm4_moe_nextn.py, deepseek_nextn.py) hard-forces the modelopt_fp4 draft decoder quant_config=None,
so the MTP MoE experts must be BF16 β an NVFP4/FP8 layer 78 would crash on the BF16 FusedMoE slot (uint8/fp8 mismatch).
This is why Mapika's repo loads fine but MTP is inert, and why this graft uses the BF16 original rather than a
quantized layer 78.
How to use β assemble with the Mapika repo
The full checkpoint = Mapika's 45 shards (layers 0β77, NVFP4) + this shard-46 (layer 78, BF16) + the rebuilt index.
# 1. Pull the full Mapika repo (provides model-00001..00045 NVFP4 shards)
huggingface-cli download Mapika/GLM-5.2-NVFP4 \
--local-dir ./glm52-nvfp4 --local-dir-use-symlinks False
# 2. Overlay this diff: copy shard-46 + the rebuilt index over it
huggingface-cli download sant1an/GLM-5.2-NVFP4-MTP \
--local-dir ./glm52-nvfp4-diff --local-dir-use-symlinks False
cp ./glm52-nvfp4-diff/model-00046-of-00046.safetensors ./glm52-nvfp4/
cp ./glm52-nvfp4-diff/model.safetensors.index.json ./glm52-nvfp4/ # overwrites Mapika index
# config/tokenizer already match; copy if you started from a bare Mapika dir.
# 3. Serve with MTP enabled (self-spec: draft path = model path)
python -m sglang.launch_server \
--model-path ./glm52-nvfp4 \
--quantization modelopt_fp4 \
--moe-runner-backend flashinfer_cutlass \
--speculative-algorithm NEXTN \
--context-length 32768 # use a large context for production; 32768 was the bench setting
NEXTN resolves to an EAGLE worker internally (server_info.speculative_algorithm == "EAGLE",
speculative_draft_model_path auto-points at the model path). Do not set --speculative-draft-model-path.
--moe-runner-backend flashinfer_cutlass propagates to the draft MoE; if the BF16 draft MoE misbehaves, try
--speculative-moe-runner-backend triton.
Verified results (2026-06-25, 8ΓB200)
- MTP active:
spec_accept_lengthβ 3.0β3.5,spec_accept_rateβ 0.65,num_draft_tokens= 4 across real traffic. - GSM8K: 95.5% (5-shot, 200q, 0 invalid). MTP is accuracy-neutral (verified-token threshold = 1.0).
- Throughput: decode-bound ceiling β 2900 tok/s / 11 req/s; accept_length β 3.5 stable across concurrency.
Full table + caveats in
benchmark/summary.md.
Credits
- Base NVFP4 weights: Mapika/GLM-5.2-NVFP4 (NVFP4 quantization of GLM-5.2).
- Layer-78 source (BF16): zai-org/GLM-5.2.
- Serving: SGLang.
This diff is released under the same MIT license as the underlying GLM-5.2 weights. The added layer-78 weights are unmodified tensors sourced from the official BF16 GLM-5.2 release.
β¨ Presented to you with Mind Lab β A Lab for Experiential Intelligence.
- Downloads last month
- 42