GLM-5.2-MTP-INT4

Standalone MTP (multi-token-prediction) draft module for speculative decoding of GLM-5.2, quantized to compressed-tensors INT4 (4-bit, group-size 32, asymmetric, pack-quantized) to match an AWQ-INT4 serving target.

This is the unpruned (256-expert) variant, intended to pair with the full cyankiwi/GLM-5.2-AWQ-INT4 target. For the 15%-pruned target, use the sibling CosmicRaisins/GLM-5.2-MTP-INT4-aligned (218 experts).

Why this exists

vLLM loads a separate MTP draft through the target model's quantization machinery, so the draft's quant scheme must match the target's exactly. A raw NVFP4 MTP extract will not load against an INT4 target (KeyError on weight-scale tensors). This artifact is the layer-78 MTP module re-expressed in the target's INT4 scheme and key-aligned to vLLM's DeepSeekMTP loader.

Lineage

  1. Extract the layer-78 MTP module from 0xSero/GLM-5.2-NVFP4-REAP-469B (REAP leaves layer 78 unpruned → 256 experts ≈ base GLM-5.2 MTP).
  2. Dequantize NVFP4 → BF16 (verified elementwise-identical to vLLM's reference dequant).
  3. Re-quantize to compressed-tensors INT4 (group-32, asymmetric) matching the AWQ-INT4 target scheme.
  4. Key/shape-align attention + shared-expert Linears to vLLM DeepSeekMTP; routed experts pass through. No expert pruning.

Usage (vLLM)

Serve the AWQ-INT4 target with this draft:

--speculative-config '{"model": "CosmicRaisins/GLM-5.2-MTP-INT4-unpruned", "method": "mtp", "num_speculative_tokens": 3}'

num_speculative_tokens 3–5; k=3 is a good default on mixed corpora, higher k helps on predictable spans (e.g. code).

Notes

  • 1 next-token-prediction layer (layer 78), 256 routed experts, ~5.9 GB. Architecture GlmMoeDsaForCausalLM / glm_moe_dsa.
  • Built and structurally verified on DGX Spark (GB10, sm_121); key layout is identical to the served 218-expert sibling. The pruned sibling benched ~1.4–2.1× decode speedup (mean acceptance length 3.4–5.0) against its 15%-REAP target; this variant targets the full unpruned model.
Downloads last month
38
Safetensors
Model size
11B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CosmicRaisins/GLM-5.2-MTP-INT4

Base model

zai-org/GLM-5.2
Quantized
(2)
this model