Model Overview

This repository hosts an NVFP4 quantized version of the Qwen3.6-27B model. The quantization process was executed using llm-compressor, employing a mixed-precision strategy to drastically reduce memory footprint while preserving essential model capabilities.

Deployment & Inference

This model is highly optimized for local inference on high-end consumer hardware. Local testing and evaluations were conducted under the following environment:

  • Hardware: 1x NVIDIA RTX 5090
  • Inference Engine: vLLM
  • KV Cache: FP8

Quantization Details

To achieve optimal performance, we applied specific quantization configurations across the model's architecture, heavily supported by advanced modifiers:

  • Quantized to NVFP4: Full attention layers, linear attention layers, and the MLP blocks.
  • Retained in BF16 (Untouched): Vision components, MTP, lm head, and embeddings.
  • Enhancements: Utilized SmoothQuant modifiers alongside GPTQ modifiers to improve the overall post-quantization performance.

Calibration Configuration

Customized calibration dataset with 512 samples and each 8192 sequence length.

Evaluation & Benchmarks

Benchmark BF16 (Alibaba Cloud) This Model (Local RTX 5090) Delta
MMLU-Pro + GPQA Diamond 鹿 84.5 85.0 +0.5
SWE-bench Verified (Easy) 58.0 57.0 -1.0
MMMU Pro 80.0 84.0 +4.0
  • Environment Note: The baseline bf16 scores were obtained via the official online deployment on Alibaba Cloud. The NVFP4 scores were obtained locally using the vLLM setup described above.
  1. MMLU-Pro + GPQA Diamond: This score is based on a 200-question subset consisting of 160 MMLU-Pro problems and 40 GPQA Diamond problems.
  2. SWE-bench Verified (Easy): Evaluated on a 100-problem subset randomly chosen from instances where the difficulty resolution time is < 15 minutes. The method used for this benchmark was Oracle retrieval.
  3. MMMU Pro: Evaluated on a subset of 100 random samples.
  • Long Context Warning: The model's long-context capabilities remain suspicious and should be approached with caution. Because the sensitive linear attention layers underwent heavy quantization, users might experience degradation when pushing the model to high sequence lengths.

05/14/26 update: add vision calibration; fix vllm: [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj) issue; update benchmarks

Downloads last month
3,390
Safetensors
Model size
17B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ

Base model

Qwen/Qwen3.6-27B
Quantized
(408)
this model