GLM-5.2-MTP-INT4
Standalone MTP (multi-token-prediction) draft module for speculative decoding of GLM-5.2, quantized to compressed-tensors INT4 (4-bit, group-size 32, asymmetric, pack-quantized) to match an AWQ-INT4 serving target.
This is the unpruned (256-expert) variant, intended to pair with the full cyankiwi/GLM-5.2-AWQ-INT4 target. For the 15%-pruned target, use the sibling CosmicRaisins/GLM-5.2-MTP-INT4-aligned (218 experts).
Why this exists
vLLM loads a separate MTP draft through the target model's quantization machinery, so the draft's quant scheme must match the target's exactly. A raw NVFP4 MTP extract will not load against an INT4 target (KeyError on weight-scale tensors). This artifact is the layer-78 MTP module re-expressed in the target's INT4 scheme and key-aligned to vLLM's DeepSeekMTP loader.
Lineage
- Extract the layer-78 MTP module from
0xSero/GLM-5.2-NVFP4-REAP-469B(REAP leaves layer 78 unpruned → 256 experts ≈ base GLM-5.2 MTP). - Dequantize NVFP4 → BF16 (verified elementwise-identical to vLLM's reference dequant).
- Re-quantize to compressed-tensors INT4 (group-32, asymmetric) matching the AWQ-INT4 target scheme.
- Key/shape-align attention + shared-expert Linears to vLLM
DeepSeekMTP; routed experts pass through. No expert pruning.
Usage (vLLM)
Serve the AWQ-INT4 target with this draft:
--speculative-config '{"model": "CosmicRaisins/GLM-5.2-MTP-INT4-unpruned", "method": "mtp", "num_speculative_tokens": 3}'
num_speculative_tokens 3–5; k=3 is a good default on mixed corpora, higher k helps on predictable spans (e.g. code).
Notes
- 1 next-token-prediction layer (layer 78), 256 routed experts, ~5.9 GB. Architecture
GlmMoeDsaForCausalLM/glm_moe_dsa. - Built and structurally verified on DGX Spark (GB10, sm_121); key layout is identical to the served 218-expert sibling. The pruned sibling benched ~1.4–2.1× decode speedup (mean acceptance length 3.4–5.0) against its 15%-REAP target; this variant targets the full unpruned model.
- Downloads last month
- 38