config.json ignore-list needs linear_attn patterns — model produces garbage otherwise

#4
by cghart123 - opened

Hi — thanks for publishing this quant. Running into a correctness issue I wanted to flag:

On vllm-project/vllm 0.20.0.dev + compressed-tensors, this model loads but produces degenerate output (every token collapses to ! at temperature 0). Startup logs show:

Parameter model.language_model.layers.N.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading
Parameter model.language_model.layers.N.linear_attn.in_proj_ba.weight not found in params_dict, skip loading

for most hybrid-attention layers. vLLM silently skips them and inference runs with effectively zero-valued linear-attention weights.

Root cause: the quantization_config.ignore list in config.json uses the old split tensor names (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) — but the checkpoint's safetensors actually contain the new combined names (in_proj_qkvz, in_proj_ba). vLLM has no fallback for this mismatch, so it treats the combined tensors as unknown and skips them.

Adding these two patterns to quantization_config.ignore fixes it (the actual tensors are BF16 and not NVFP4-quantized, so ignore is the right semantics):

"re:.*linear_attn\\.in_proj_qkvz$",
"re:.*linear_attn\\.in_proj_ba$"

After this patch, the model produces coherent output (measured 51.9 tok/s on DGX Spark GB10 via bjk110's vLLM image).

Same pattern affects other community NVFP4 quants of Qwen3-Next-family models. Also tracked against the inference engines: filed a vLLM issue proposing loud-warn / auto-alias handling at https://github.com/vllm-project/vllm/issues/40252, and it was originally surfaced for sglang at https://github.com/sgl-project/sglang/issues/20973.

Would be a small but meaningful fix for downstream users — they currently see garbage output with no obvious error in the logs. Happy to help test a re-uploaded version if useful.

Reference / reproduction notes: https://gist.github.com/cghart/5374e7f749cb02e1ea96282893de64bd

Red Hat AI org

Hi @cghart123 as of today's vllm nightly, I am able to serve the model as is without issue.
I am on 0.19.2rc1.dev55+g21b086d0a for vllm and don't see any of the warnings you listed in the issue.

Do you have a reference PR where this change was made?

Good catch. Re-tested on vllm==0.20.0.dev0+cu132 (image ghcr.io/bjk110/vllm-spark:v020-ngc2603, built 2026-04-17) with the unpatched HF config — coherent output at temp 0, no weight-skip warnings. https://github.com/vllm-project/vllm/pull/34697 (Feb 18) is the fix: it added "in_proj_qkvz": ["in_proj_qkv","in_proj_z"] and "in_proj_ba": ["in_proj_b","in_proj_a"] to packed_modules_mapping in qwen3_5.py, so the old-name ignore list now aliases onto the fused tensors at load time. Equivalent mapping for Qwen3-Next landed in https://github.com/vllm-project/vllm/pull/35413.

My 4/18 repro must have been on a pre-#34697 build. I'll close https://github.com/vllm-project/vllm/issues/40252 referencing those PRs. The HF repo still ships the old-name ignore list though, so anyone pinned to a pre-Feb-18 vLLM will still hit silent degradation — happy to submit a small PR updating config.json if you'd like belt-and-braces.

Red Hat AI org
edited Apr 21

For fusing layers, this change should be on the vLLM side. compressed-tensors does not fuse the layers for any block, such as qkv or gate_up, which are typically fused in the mlp and attn layers in vllm. If this failure is associated with a stable release, I would suggest starting a thread in the vllm slack to see if there's any interest in cherry picking the fix or updating your vllm version.

dsikka changed discussion status to closed

Sign up or log in to comment