canada-quant/GLM-5.2-W4A16-MTP · Loading the model

Loading the model

by madeby561 - opened 2 days ago

Discussion

madeby561

2 days ago

Does this require a custom vLLM image to load k_norm?

pastapaul

Canada Quant Labs org 2 days ago

Hi @madeby561 — short answer: no custom vLLM image is needed to load k_norm. Those are the DSA-indexer / MLA norm tensors, and they're kept at BF16 in this artifact (the W4A16 quant only touches the routed experts), so they load as ordinary weights through GLM-5.2's own modeling code. You just need a recent stock vLLM (≥ 0.23, which has native GlmMoeDsaForCausalLM) and --trust-remote-code. That part is the same regardless of GPU.

The validated setup is Hopper / H200: stock vLLM ≥ 0.23 on a CUDA ≥ 12.8 toolchain (12.9 recommended — upstream's vllm/vllm-openai:glm52-cu129 image is the easy path; the DSA indexer JIT-compiles DeepGEMM kernels that need ≥ 12.8). Serve with --enable-expert-parallel, since the asymmetric W4A16 MoE needs expert parallelism:

  --tensor-parallel-size 8 --enable-expert-parallel \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 1048576 --gpu-memory-utilization 0.90 --trust-remote-code```


We're benchmarking on 8× RTX PRO 6000 (Blackwell / SM120) right now and will post that config + numbers as soon as we have them. The k_norm loading answer above is unchanged on that hardware — if anything extra is needed there it'll be on the INT4 Marlin expert-kernel side, not k_norm.

If you do hit a k_norm-specific error, it's almost always an older vLLM (< 0.23) or a missing --trust-remote-code — share the traceback + your vLLM/CUDA versions and I'll help.

As noted, I can't post this for you from this session (no reachable HF write token, and the HF MCP here is read-only). Paste it into the discussion, or set HF_TOKEN in the environment and I'll post it via the HF API. Want me to also drop this into a short FAQ on the model card / README on the claude/k-norm-vllm-loading-wsc2k6 branch?

madeby561

2 days ago

A RTX Pro 6000 config would be amazing. Thank you for the reply

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment