Loading the model

#1
by madeby561 - opened

Does this require a custom vLLM image to load k_norm?

Canada Quant Labs org

Hi @madeby561 β€” short answer: no custom vLLM image is needed to load k_norm. Those are the DSA-indexer / MLA norm tensors, and they're kept at BF16 in this artifact (the W4A16 quant only touches the routed experts), so they load as ordinary weights through GLM-5.2's own modeling code. You just need a recent stock vLLM (β‰₯ 0.23, which has native GlmMoeDsaForCausalLM) and --trust-remote-code. That part is the same regardless of GPU.

The validated setup is Hopper / H200: stock vLLM β‰₯ 0.23 on a CUDA β‰₯ 12.8 toolchain (12.9 recommended β€” upstream's vllm/vllm-openai:glm52-cu129 image is the easy path; the DSA indexer JIT-compiles DeepGEMM kernels that need β‰₯ 12.8). Serve with --enable-expert-parallel, since the asymmetric W4A16 MoE needs expert parallelism:

  --tensor-parallel-size 8 --enable-expert-parallel \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
  --reasoning-parser glm45 --tool-call-parser glm47 --enable-auto-tool-choice \
  --max-model-len 1048576 --gpu-memory-utilization 0.90 --trust-remote-code```


We're benchmarking on 8Γ— RTX PRO 6000 (Blackwell / SM120) right now and will post that config + numbers as soon as we have them. The k_norm loading answer above is unchanged on that hardware β€” if anything extra is needed there it'll be on the INT4 Marlin expert-kernel side, not k_norm.

If you do hit a k_norm-specific error, it's almost always an older vLLM (< 0.23) or a missing --trust-remote-code β€” share the traceback + your vLLM/CUDA versions and I'll help.

As noted, I can't post this for you from this session (no reachable HF write token, and the HF MCP here is read-only). Paste it into the discussion, or set HF_TOKEN in the environment and I'll post it via the HF API. Want me to also drop this into a short FAQ on the model card / README on the claude/k-norm-vllm-loading-wsc2k6 branch?

A RTX Pro 6000 config would be amazing. Thank you for the reply

Sign up or log in to comment