Model Overview
This repository hosts an NVFP4 quantized version of the Qwen3.6-27B model. The quantization process was executed using llm-compressor, employing a mixed-precision strategy to drastically reduce memory footprint while preserving essential model capabilities.
Deployment & Inference
This model is highly optimized for local inference on high-end consumer hardware. Local testing and evaluations were conducted under the following environment:
- Hardware: 1x NVIDIA RTX 5090
- Inference Engine: vLLM
- KV Cache: FP8
Quantization Details
To achieve optimal performance, we applied specific quantization configurations across the model's architecture, heavily supported by advanced modifiers:
- Quantized to NVFP4: Full attention layers, linear attention layers, and the MLP blocks.
- Retained in BF16 (Untouched): Vision components, MTP, lm head, and embeddings.
- Enhancements: Utilized SmoothQuant modifiers alongside GPTQ modifiers to improve the overall post-quantization performance.
Calibration Configuration
Customized calibration dataset with 512 samples and each 8192 sequence length.
Evaluation & Benchmarks
| Benchmark | BF16 (Alibaba Cloud) | This Model (Local RTX 5090) | Delta |
|---|---|---|---|
| MMLU-Pro + GPQA Diamond 鹿 | 84.5 | 85.0 | +0.5 |
| SWE-bench Verified (Easy) 虏 | 58.0 | 57.0 | -1.0 |
| MMMU Pro 鲁 | 80.0 | 84.0 | +4.0 |
- Environment Note: The baseline
bf16scores were obtained via the official online deployment on Alibaba Cloud. TheNVFP4scores were obtained locally using the vLLM setup described above.
- MMLU-Pro + GPQA Diamond: This score is based on a 200-question subset consisting of 160 MMLU-Pro problems and 40 GPQA Diamond problems.
- SWE-bench Verified (Easy): Evaluated on a 100-problem subset randomly chosen from instances where the difficulty resolution time is < 15 minutes. The method used for this benchmark was Oracle retrieval.
- MMMU Pro: Evaluated on a subset of 100 random samples.
- Long Context Warning: The model's long-context capabilities remain suspicious and should be approached with caution. Because the sensitive linear attention layers underwent heavy quantization, users might experience degradation when pushing the model to high sequence lengths.
05/14/26 update: add vision calibration; fix vllm: [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj) issue; update benchmarks
- Downloads last month
- 3,390
Model tree for sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ
Base model
Qwen/Qwen3.6-27B