GLM-5.2 NVFP4 + MTP (NEXTN) — layer-78 graft diff

This is a diff/overlay on top of Mapika/GLM-5.2-NVFP4 that enables MTP / NEXTN speculative decoding for GLM-5.2 in NVFP4 precision. It is not a standalone full checkpoint.

What this is

Mapika/GLM-5.2-NVFP4 ships layers 0–77 in NVFP4 but omits layer 78 (the MTP/NEXTN draft layer). Its config.json already declares num_nextn_predict_layers=1, yet SGLang silently skips the missing draft weights, so MTP never fires (garbage drafts, no error). This repo adds exactly that missing layer.

model-00046-of-00046.safetensors (~19.9 GB, BF16) — layer 78, the MTP draft decoder. Grafted from the original BF16 checkpoint zai-org/GLM-5.2 (shards model-00270–00274-of-00282). All 791 tensors (790 BF16 + 1 fp32 e_score_correction_bias) are BF16.
model.safetensors.index.json — rebuilt so the 791 layer-78 weights point at the new shard-46 (total_size updated 439842893136 → 459748734800).
config.json, generation_config.json, chat_template.jinja, hf_quant_config.json, tokenizer files — copies from Mapika for standalone usability.
benchmark/ — GSM8K accuracy + concurrency sweep results (see below).

Why layer 78 must be BF16

SGLang (glm4_moe_nextn.py, deepseek_nextn.py) hard-forces the modelopt_fp4 draft decoder quant_config=None, so the MTP MoE experts must be BF16 — an NVFP4/FP8 layer 78 would crash on the BF16 FusedMoE slot (uint8/fp8 mismatch). This is why Mapika's repo loads fine but MTP is inert, and why this graft uses the BF16 original rather than a quantized layer 78.

How to use — assemble with the Mapika repo

The full checkpoint = Mapika's 45 shards (layers 0–77, NVFP4) + this shard-46 (layer 78, BF16) + the rebuilt index.

# 1. Pull the full Mapika repo (provides model-00001..00045 NVFP4 shards)
huggingface-cli download Mapika/GLM-5.2-NVFP4 \
  --local-dir ./glm52-nvfp4 --local-dir-use-symlinks False

# 2. Overlay this diff: copy shard-46 + the rebuilt index over it
huggingface-cli download sant1an/GLM-5.2-NVFP4-MTP \
  --local-dir ./glm52-nvfp4-diff --local-dir-use-symlinks False
cp ./glm52-nvfp4-diff/model-00046-of-00046.safetensors  ./glm52-nvfp4/
cp ./glm52-nvfp4-diff/model.safetensors.index.json       ./glm52-nvfp4/   # overwrites Mapika index
# config/tokenizer already match; copy if you started from a bare Mapika dir.

# 3. Serve with MTP enabled (self-spec: draft path = model path)
python -m sglang.launch_server \
  --model-path ./glm52-nvfp4 \
  --quantization modelopt_fp4 \
  --moe-runner-backend flashinfer_cutlass \
  --speculative-algorithm NEXTN \
  --context-length 32768   # use a large context for production; 32768 was the bench setting

NEXTN resolves to an EAGLE worker internally (server_info.speculative_algorithm == "EAGLE", speculative_draft_model_path auto-points at the model path). Do not set --speculative-draft-model-path. --moe-runner-backend flashinfer_cutlass propagates to the draft MoE; if the BF16 draft MoE misbehaves, try --speculative-moe-runner-backend triton.

Verified results (2026-06-25, 8×B200)

MTP active: spec_accept_length ≈ 3.0–3.5, spec_accept_rate ≈ 0.65, num_draft_tokens = 4 across real traffic.
GSM8K: 95.5% (5-shot, 200q, 0 invalid). MTP is accuracy-neutral (verified-token threshold = 1.0).
Throughput: decode-bound ceiling ≈ 2900 tok/s / 11 req/s; accept_length ≈ 3.5 stable across concurrency. Full table + caveats in benchmark/summary.md.

Credits

Base NVFP4 weights: Mapika/GLM-5.2-NVFP4 (NVFP4 quantization of GLM-5.2).
Layer-78 source (BF16): zai-org/GLM-5.2.
Serving: SGLang.

This diff is released under the same MIT license as the underlying GLM-5.2 weights. The added layer-78 weights are unmodified tensors sourced from the official BF16 GLM-5.2 release.

_{✨ Presented to you with Mind Lab — A Lab for Experiential Intelligence.}

Downloads last month: 42

Model tree for sant1an/GLM-5.2-NVFP4-MTP

Base model

zai-org/GLM-5.2

Quantized

Mapika/GLM-5.2-NVFP4

Quantized

(1)

this model