config.json ignore-list needs linear_attn patterns — model produces garbage otherwise
Hi — thanks for publishing this quant. Running into a correctness issue I wanted to flag:
On vllm-project/vllm 0.20.0.dev + compressed-tensors, this model loads but produces degenerate output (every token collapses to ! at temperature 0). Startup logs show:
Parameter model.language_model.layers.N.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading
Parameter model.language_model.layers.N.linear_attn.in_proj_ba.weight not found in params_dict, skip loading
for most hybrid-attention layers. vLLM silently skips them and inference runs with effectively zero-valued linear-attention weights.
Root cause: the quantization_config.ignore list in config.json uses the old split tensor names (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a) — but the checkpoint's safetensors actually contain the new combined names (in_proj_qkvz, in_proj_ba). vLLM has no fallback for this mismatch, so it treats the combined tensors as unknown and skips them.
Adding these two patterns to quantization_config.ignore fixes it (the actual tensors are BF16 and not NVFP4-quantized, so ignore is the right semantics):
"re:.*linear_attn\\.in_proj_qkvz$",
"re:.*linear_attn\\.in_proj_ba$"
After this patch, the model produces coherent output (measured 51.9 tok/s on DGX Spark GB10 via bjk110's vLLM image).
Same pattern affects other community NVFP4 quants of Qwen3-Next-family models. Also tracked against the inference engines: filed a vLLM issue proposing loud-warn / auto-alias handling at https://github.com/vllm-project/vllm/issues/40252, and it was originally surfaced for sglang at https://github.com/sgl-project/sglang/issues/20973.
Would be a small but meaningful fix for downstream users — they currently see garbage output with no obvious error in the logs. Happy to help test a re-uploaded version if useful.
Reference / reproduction notes: https://gist.github.com/cghart/5374e7f749cb02e1ea96282893de64bd
Hi @cghart123 as of today's vllm nightly, I am able to serve the model as is without issue.
I am on 0.19.2rc1.dev55+g21b086d0a for vllm and don't see any of the warnings you listed in the issue.
Do you have a reference PR where this change was made?
Good catch. Re-tested on vllm==0.20.0.dev0+cu132 (image ghcr.io/bjk110/vllm-spark:v020-ngc2603, built 2026-04-17) with the unpatched HF config — coherent output at temp 0, no weight-skip warnings. https://github.com/vllm-project/vllm/pull/34697 (Feb 18) is the fix: it added "in_proj_qkvz": ["in_proj_qkv","in_proj_z"] and "in_proj_ba": ["in_proj_b","in_proj_a"] to packed_modules_mapping in qwen3_5.py, so the old-name ignore list now aliases onto the fused tensors at load time. Equivalent mapping for Qwen3-Next landed in https://github.com/vllm-project/vllm/pull/35413.
My 4/18 repro must have been on a pre-#34697 build. I'll close https://github.com/vllm-project/vllm/issues/40252 referencing those PRs. The HF repo still ships the old-name ignore list though, so anyone pinned to a pre-Feb-18 vLLM will still hit silent degradation — happy to submit a small PR updating config.json if you'd like belt-and-braces.
For fusing layers, this change should be on the vLLM side. compressed-tensors does not fuse the layers for any block, such as qkv or gate_up, which are typically fused in the mlp and attn layers in vllm. If this failure is associated with a stable release, I would suggest starting a thread in the vllm slack to see if there's any interest in cherry picking the fix or updating your vllm version.