Qwen3.5-9B NVFP4 (MLP-only, MSE calibration)

NVFP4-quantized variant of Qwen/Qwen3.5-9B, produced with NVIDIA Model Optimizer using the nvfp4_mlp_only_mse-fp8_cast_kv recipe from PR #1391.

Quantization details

Component Precision Notes
MLP weights (32 layers) NVFP4 (W4A4, block-16, e2m1 / e4m3 scale) quantized
self_attn QKVO (8 layers) BF16 preserved
linear_attn blocks (24 layers) BF16 preserved (Mamba-style hybrid layers)
embed, lm_head, norm, mtp, visual BF16 preserved
KV cache FP8 use_constant_amax: true
Calibration MSE + fp8_scale_sweep: true static MLP weight scales

Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).

Evaluation

Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.

Metric BF16 This model
MMLU-Pro pass@1 82.89 82.40 (−0.49)
AIME 2025 avg-of-64 67.34 65.36 (−1.98)
AIME 2025 majority@64 90.00 87.78 (−2.22)
LCB pass@3 66.08 68.72 (+2.64)
GPQA avg-of-8 81.06 80.68 (−0.38)
GPQA majority@8 83.84 83.59 (−0.25)
AA-LCR pass@1 [avg-of-3] 56.33 50.67 (−5.66)
AA-LCR pass@3 71.00 66.00 (−5.00)
τ²-bench-telecom pass@1 15.79 12.28 (−3.51)

Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).

Usage

vLLM

vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --reasoning-parser qwen3 \
  --max-model-len 131072 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --no-enable-prefix-caching

For tool-calling workloads (e.g. τ²-bench), also pass:

--enable-auto-tool-choice --tool-call-parser hermes

Container

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Acknowledgments

  • Base model: Qwen team
  • Quantization recipe: NVIDIA Model Optimizer PR #1391
Downloads last month
23
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for davidyu-nv/Qwen3.5-9B-NVFP4-MSE

Finetuned
Qwen/Qwen3.5-9B
Quantized
(286)
this model