Model Overview

This repository hosts an NVFP4 quantized version of the Qwen3.6-27B model. The quantization process was executed using llm-compressor, employing a mixed-precision strategy to drastically reduce memory footprint while preserving essential model capabilities.

Deployment & Inference

This model is highly optimized for local inference on high-end consumer hardware. Local testing and evaluations were conducted under the following environment:

Hardware: 1x NVIDIA RTX 5090
Inference Engine: vLLM
KV Cache: FP8

Quantization Details

To achieve optimal performance, we applied specific quantization configurations across the model's architecture, heavily supported by advanced modifiers:

Quantized to NVFP4: Full attention layers, linear attention layers, and the MLP blocks.
Retained in BF16 (Untouched): Vision components, MTP, lm head, and embeddings.
Enhancements: Utilized SmoothQuant modifiers alongside GPTQ modifiers to improve the overall post-quantization performance.

Calibration Configuration

Customized calibration dataset with 512 samples and each 8192 sequence length.

Evaluation & Benchmarks

Benchmark	BF16 (Alibaba Cloud)	This Model (Local RTX 5090)	Delta
MMLU-Pro + GPQA Diamond ¹	84.5	85.0	+0.5
SWE-bench Verified (Easy) ²	58.0	57.0	-1.0
MMMU Pro ³	80.0	84.0	+4.0

Environment Note: The baseline bf16 scores were obtained via the official online deployment on Alibaba Cloud. The NVFP4 scores were obtained locally using the vLLM setup described above.

MMLU-Pro + GPQA Diamond: This score is based on a 200-question subset consisting of 160 MMLU-Pro problems and 40 GPQA Diamond problems.
SWE-bench Verified (Easy): Evaluated on a 100-problem subset randomly chosen from instances where the difficulty resolution time is < 15 minutes. The method used for this benchmark was Oracle retrieval.
MMMU Pro: Evaluated on a subset of 100 random samples.

Long Context Warning: The model's long-context capabilities remain suspicious and should be approached with caution. Because the sensitive linear attention layers underwent heavy quantization, users might experience degradation when pushing the model to high sequence lengths.

05/14/26 update: add vision calibration; fix vllm: [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj) issue; update benchmarks

Downloads last month: 3,390

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ

Base model

Qwen/Qwen3.6-27B

Quantized

(408)

this model