GLM-5.2 NVFP4 Int8Mix

This is an experimental hybrid GLM-5.2 checkpoint for vLLM/B12X serving.

It combines:

  • the dense, attention, shared-expert, special-head, and MTP tensors from QuantTrio/GLM-5.2-Int4-Int8Mix;
  • the non-shared routed MoE expert MLP projections from lukealonso/GLM-5.2-NVFP4;
  • a vLLM-compatible compressed-tensors config update for the fused GLM-5.2 runtime module names used by current vLLM.

The repository name currently contains MTPFix because the first upload used that internal working name. The actual checkpoint identity is better described as GLM-5.2 NVFP4 Int8Mix.

Provenance

Base model:

Quantized sources:

This is not a full re-quantization from BF16. It is a merged checkpoint:

  • QuantTrio supplies the W8A16 dense/attention/shared/MTP parts and BF16 unquantized tensors.
  • Luke NVFP4 supplies the routed expert MLP projections for layers 3-77.
  • The config was adjusted so vLLM can load the fused MTP names (mtp_block, fused_qkv_a_proj) used at runtime.

Quantization layout

The effective quantization_config uses compressed-tensors with format: nvfp4-pack-quantized.

Scope Format
model.layers.0 and ignored special paths BF16
Dense attention and ordinary linear weights in layers 1-77 W8A16 INT8, symmetric group quantization, group size 128
Shared experts in layers 1-77 W8A16 INT8, symmetric group quantization, group size 128
Non-shared routed MoE experts in layers 3-77 NVFP4-style float 4-bit weights, tensor-group strategy, group size 16
Layer 78 MTP block W8A16 INT8, channel-wise
mlp.gate, attention indexer, norms, embeddings, and special heads BF16 / ignored

Compared with the original QuantTrio checkpoint, the routed expert tensors are not INT4 group-size-128 weights anymore. They are replaced by Luke's NVFP4 expert tensors.

Compared with Luke's NVFP4 checkpoint, this checkpoint does not keep the dense and attention parts in the same BF16/NVFP4 ModelOpt layout. Those parts come from QuantTrio's compact W8A16 export.

Notes on NVFP4 expert quality

Luke's NVFP4 checkpoint quantizes directly from the BF16 GLM-5.2 checkpoint using NVIDIA Model Optimizer. In that source checkpoint, only the non-shared MoE expert MLP projections are quantized to NVFP4; attention weights, early dense MLP layers, and shared experts are left unquantized. The calibration uses natural top-k routing rather than forcing all experts active, with broad sample coverage to better match the distributions experts see during inference.

That matters for this hybrid checkpoint because the routed MoE experts are the largest parameter component and the most routing-sensitive part of GLM-5.2. NVFP4 uses small 16-value floating-point blocks with FP8 scale metadata, while the original QuantTrio expert path uses integer 4-bit group quantization with group size 128. The finer scaling granularity is one reason the NVFP4 expert path can preserve the BF16 distribution better in local KLD tests.

Measured local distribution quality

KLD/JS is a local next-token distribution proxy, not a full model-quality benchmark. It is useful for detecting numerical regressions, but deployment quality should also be checked with long-context tasks, coding prompts, tool calling, repetition/CJK watchdogs, MTP acceptance, throughput, and VRAM.

Repeated local KLD measurements from the vLLM/B12X test stack showed:

Checkpoint Prefill KLD mean Decode JS mean
Luke NVFP4 0.068257 0.00000236
QuantTrio GLM-5.2 Int4-Int8Mix 0.070448 0.00000286
This hybrid, W8A16 + Luke NVFP4 experts 0.071182 0.00000264

Interpretation:

  • Luke NVFP4 remains the strongest of these practical-size checkpoints in the repeated local distribution test.
  • This hybrid is close to QuantTrio on prefill and slightly better on repeated decode JS in that run set, but the decode differences are small and overlap run-to-run variance.
  • Do not treat KLD alone as a final quality ranking. It is one signal.

Serving status

This checkpoint was prepared for the local vLLM/B12X GLM-5.2 stack used by local-inference-lab/rtx6kpro. It is not claimed to be a generic drop-in model for every runtime.

Known working class of configuration:

  • vLLM with GLM-5.2 support
  • --quantization compressed-tensors
  • --kv-cache-dtype fp8
  • --attention-backend B12X_MLA_SPARSE
  • --moe-backend b12x
  • B12X A16 expert serving supported

Example shape used in local testing:

vllm serve /path/to/GLM-5.2-NVFP4-Int8Mix \
  --served-model-name GLM-5.2 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 1 \
  --quantization compressed-tensors \
  --attention-backend B12X_MLA_SPARSE \
  --moe-backend b12x \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser glm47 \
  --reasoning-parser glm45

For exact Docker images and launch recipes used in local benchmarking, see the GLM-5.2 v12 notes in:

https://github.com/local-inference-lab/rtx6kpro/blob/master/models/glm5.2_v12.md

File size

Approximate uploaded size: 409.33 GiB.

License

The model card inherits the MIT license metadata from the source GLM-5.2 release and source model cards. Check the upstream model cards for complete license and usage details.

Downloads last month
182
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for festr2/GLM-5.2-Int8Mix-NVFP4

Base model

zai-org/GLM-5.2
Quantized
(1)
this model