Nvidia-Qwen3.6-27B-NVFP4 - GGUF

Quantized GGUF versions of nvidia/Qwen3.6-27B-NVFP4. These were generated using llama.cpp's convert_hf_to_gguf.py (b9859).

  • Nvidia-Qwen3.6-27B-NVFP4-A.gguf - All layers are NVFP4 quantized. This required modifying convert_hf_to_gguf.py, and needs cleaning up before possible upstreaming.
  • Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf: NVFP4 FFN layers are preserved, while FP8 attention layers are upcasted to BF16. This is the default conversion for BF16 because GGUF files do not support FP8.

Quantizations provided

File Quantization Size
Nvidia-Qwen3.6-27B-NVFP4-A.gguf NVFP4 17.9 GB
Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf NVFP4 FFN, BF16 attention 28.2 GB

Perplexity test

I tested perplexity using llama-perplexity and Salesforce's wikitext-2-raw-v1.

File Ctx PPL
Nvidia-Qwen3.6-27B-NVFP4-A.gguf 512 7.7540 ± 0.05396
Nvidia-Qwen3.6-27B-NVFP4-BF16-Attn.gguf 512 7.4814 ± 0.05157

Evaluation

The following models were evaluated for a fair comparison of capability, size and speed.

Model Quantization Size Reason
unsloth/Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL 17.9 GB Closest non-NVFP4 in size to NVFP4.
unsloth/Qwen3.6-27B-MTP-GGUF UD-Q6_K_XL 26 GB Closest non-NVFP4 in size to BF16-Attn.
unsloth/Qwen3.6-27B-NVFP4 NVFP41 25.4 GB Alternative NVFP4 quant.

1: unsloth/Qwen3.6-27B-NVFP4 does not provide a GGUF. I used llama.cpp's conversion which passes through Unsloth's NVFP4 tensors.

CodeFault
NVFP4
CodeFault
BF16-Attn
Unsloth
NVFP4
Unsloth
UD-Q4_K_XL
Unsloth
UD-Q6_K_XL
Coding
HumanEval 0.8415 ± 0.0286 0.8354 ± 0.029 0.811 ± 0.0307 0.8354 ± 0.029 0.8537 ± 0.0277
HumanEval+ 0.7866 ± 0.0321 0.7927 ± 0.0318 0.7744 ± 0.0327 0.7805 ± 0.0324 0.7805 ± 0.0324
MBPP 0.006 ± 0.0035!! 0.754 ± 0.0193 0.742 ± 0.0196 0.756 ± 0.0192 0.754 ± 0.0193
MBPP+ 0.0106 ± 0.0053!! 0.8836 ± 0.0165 0.8995 ± 0.0155 0.8968 ± 0.0157 0.8836 ± 0.0165
Instruction
IFEval 0.8447 ± 0.0156 0.841 ± 0.0157 0.8447 ± 0.0156
Knowledge
ARC-Challenge 0.9659 ± 0.0053 0.971 ± 0.0049 0.971 ± 0.0049 0.971 ± 0.0049 0.971 ± 0.0049
MMLU-Pro 0.835 ± 0.0033
STEM & Reasoning
BIG-Bench Hard 0.926 ± 0.003
GPQA Diamond
GSM8K 0.9098 ± 0.0079 0.9083 ± 0.008 0.9158 ± 0.0076
Hendrycks Math

NOTICE: These tests are actively running.

!!: Such a drastic failure suggests something is wrong with the harness, not the model. I still need to investigate.

These evaluations were run using lm_eval. The models were run in instruct (non-thinking) mode with the following parameters in llama-server (b9775):

ctx-size = 32768
cache-type-k = q8_0
cache-type-v = q8_0
top-k = 20
top-p = 0.8
min-p = 0
presence-penalty = 1.5
spec_type = draft-mtp
spec_draft_n_max = 2
chat-template-kwargs = {"enable_thinking":false}

Benchmarks

Benchmarks: Coming after evaluations.

Serving with llama.cpp

It has a max context size of 262,114. This can be served using:

llama-server \
-hf CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF:NVFP4 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--repeat-penalty 1.1 \
--spec-type draft-mtp \
--spec-draft-n-max 2
Downloads last month
920
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CodeFault/Nvidia-Qwen3.6-27B-NVFP4-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(3)
this model