GLM-5.2-MTP-INT4-aligned

Separate-draft MTP (multi-token-prediction) head for speculative decoding with CosmicRaisins/GLM-5.2-AWQ-INT4-15pct. cyankiwi/GLM-5.2-AWQ-INT4 drops GLM-5.2's native MTP layer, so this reconstructs it as a standalone INT4 draft that vLLM loads via --speculative-config (which also sidesteps vLLM #35041 / #38494). INT4 compressed-tensors, single MTP layer, 218 experts to match the pruned target. ~4.8 GB.

Use

--speculative-config '{"model":"CosmicRaisins/GLM-5.2-MTP-INT4-aligned","method":"mtp","num_speculative_tokens":3,"attention_backend":"FLASHMLA_SPARSE"}'

k=3 benched best for me on a synthetic corpus; Z.ai recommends k=5, and I haven't compared them in real-world use yet. Serving stack: github.com/CosmicRaisins/glm-5.2-gb10.

Lineage & license

The MTP weights trace to GLM-5.2's native MTP head (Z.ai, MIT), sourced via 0xSero's NVFP4 layer-78 (0xSero/GLM-5.2-NVFP4-REAP-469B), then dequantized, re-quantized to INT4, expert-pruned to 218, and aligned to the DeepSeekMTP loader layout. The reconstruction and alignment are mine.

License MIT, inherited from the GLM-5.2 base: the MTP weights carry GLM-5.2's MIT grant through 0xSero's quantization (0xSero's repo declares no explicit license). Attributed to Z.ai and 0xSero; not affiliated with either.

Downloads last month
294
Safetensors
Model size
9B params
Tensor type
I64
F32
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for CosmicRaisins/GLM-5.2-MTP-INT4-15pct

Base model

zai-org/GLM-5.2
Quantized
(1)
this model