GLM-5.2 NVFP4 + MTP (NEXTN) β€” layer-78 graft diff

This is a diff/overlay on top of Mapika/GLM-5.2-NVFP4 that enables MTP / NEXTN speculative decoding for GLM-5.2 in NVFP4 precision. It is not a standalone full checkpoint.

What this is

Mapika/GLM-5.2-NVFP4 ships layers 0–77 in NVFP4 but omits layer 78 (the MTP/NEXTN draft layer). Its config.json already declares num_nextn_predict_layers=1, yet SGLang silently skips the missing draft weights, so MTP never fires (garbage drafts, no error). This repo adds exactly that missing layer.

  • model-00046-of-00046.safetensors (~19.9 GB, BF16) β€” layer 78, the MTP draft decoder. Grafted from the original BF16 checkpoint zai-org/GLM-5.2 (shards model-00270–00274-of-00282). All 791 tensors (790 BF16 + 1 fp32 e_score_correction_bias) are BF16.
  • model.safetensors.index.json β€” rebuilt so the 791 layer-78 weights point at the new shard-46 (total_size updated 439842893136 β†’ 459748734800).
  • config.json, generation_config.json, chat_template.jinja, hf_quant_config.json, tokenizer files β€” copies from Mapika for standalone usability.
  • benchmark/ β€” GSM8K accuracy + concurrency sweep results (see below).

Why layer 78 must be BF16

SGLang (glm4_moe_nextn.py, deepseek_nextn.py) hard-forces the modelopt_fp4 draft decoder quant_config=None, so the MTP MoE experts must be BF16 β€” an NVFP4/FP8 layer 78 would crash on the BF16 FusedMoE slot (uint8/fp8 mismatch). This is why Mapika's repo loads fine but MTP is inert, and why this graft uses the BF16 original rather than a quantized layer 78.

How to use β€” assemble with the Mapika repo

The full checkpoint = Mapika's 45 shards (layers 0–77, NVFP4) + this shard-46 (layer 78, BF16) + the rebuilt index.

# 1. Pull the full Mapika repo (provides model-00001..00045 NVFP4 shards)
huggingface-cli download Mapika/GLM-5.2-NVFP4 \
  --local-dir ./glm52-nvfp4 --local-dir-use-symlinks False

# 2. Overlay this diff: copy shard-46 + the rebuilt index over it
huggingface-cli download sant1an/GLM-5.2-NVFP4-MTP \
  --local-dir ./glm52-nvfp4-diff --local-dir-use-symlinks False
cp ./glm52-nvfp4-diff/model-00046-of-00046.safetensors  ./glm52-nvfp4/
cp ./glm52-nvfp4-diff/model.safetensors.index.json       ./glm52-nvfp4/   # overwrites Mapika index
# config/tokenizer already match; copy if you started from a bare Mapika dir.

# 3. Serve with MTP enabled (self-spec: draft path = model path)
python -m sglang.launch_server \
  --model-path ./glm52-nvfp4 \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm NEXTN \
  --context-length 32768   # use a large context for production; 32768 was the bench setting

NEXTN resolves to an EAGLE worker internally (server_info.speculative_algorithm == "EAGLE", speculative_draft_model_path auto-points at the model path). Do not set --speculative-draft-model-path. --moe-runner-backend flashinfer_cutlass propagates to the draft MoE; if the BF16 draft MoE misbehaves, try --speculative-moe-runner-backend triton.

Verified results (2026-06-25, 8Γ—B200)

  • MTP active: spec_accept_length β‰ˆ 3.0–3.5, spec_accept_rate β‰ˆ 0.65, num_draft_tokens = 4 across real traffic.
  • GSM8K: 95.5% (5-shot, 200q, 0 invalid). MTP is accuracy-neutral (verified-token threshold = 1.0).
  • Throughput: decode-bound ceiling β‰ˆ 2900 tok/s / 11 req/s; accept_length β‰ˆ 3.5 stable across concurrency. Full table + caveats in benchmark/summary.md.

Credits

This diff is released under the same MIT license as the underlying GLM-5.2 weights. The added layer-78 weights are unmodified tensors sourced from the official BF16 GLM-5.2 release.


✨ Presented to you with Mind Lab β€” A Lab for Experiential Intelligence.

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sant1an/GLM-5.2-NVFP4-MTP

Base model

zai-org/GLM-5.2
Quantized
(1)
this model