Qwopus3.6-27B-Coder-NVFP4-MTP-GGUF

NVFP4 (Blackwell native FP4) quantized GGUF of Jackrong/Qwopus3.6-27B-Coder-MTP for llama.cpp with Multi-Token Prediction (MTP) speculative decoding support.

Quantization Details

Attribute	Value
Source	Q8_0 GGUF (28 GB, 8.50 BPW)
Output	Mixed-precision NVFP4 (15 GB, 4.60 BPW)
Size reduction	46% (28 GB → 15 GB)
NVFP4 tensors	311 (attn_q, attn_k, attn_v, attn_qkv, attn_output, ffn_down, ffn_gate, ffn_up)
Q4_K tensors	194 (attn_gate, ssm_alpha, ssm_beta, ssm_out, nextn.eh_proj, token_embd)
Q4_K_S tensors	1 (output.weight)
F32 tensors	360 (norms, biases, SSM state)
Total tensors	866

Tensor Mapping Strategy

Following Unsloth's NVFP4 approach for Qwen3.6-27B hybrid Mamba2-Transformer models:

NVFP4 → 8 large weight tensor patterns that dominate model size and bandwidth (attention projections + FFN weights)
Q4_K → Smaller weights (SSM parameters, MTP head projection, token embeddings) — preserves quality where tensor dimensions are small
F32 → Norms, biases, and SSM state — must remain full precision for numerical stability

This mapping matches the reference Qwen3.6-27B-NVFP4-MTP quantization exactly.

Performance

NVIDIA DGX Spark (GB10, ARM64, 128 GB unified memory)

Config	Decode Speed	Prefill	Draft Acceptance
Baseline (no spec)	14.4 tok/s	166 tok/s	—
MTP nmax=4	25.4 tok/s	150 tok/s	47%

NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)

Expected performance based on the identical Qwopus3.6-27B-v2 NVFP4 architecture:

Config	Decode Speed	vs Q8_0
Baseline (no spec)	~79 tok/s	1.72× faster
MTP nmax=4	~136 tok/s	1.91× faster

Usage

llama-server (recommended)

llama-server \
  --model Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
  --mmproj mmproj-F32.gguf \
  --chat-template-file prompt.jinja \
  --host 0.0.0.0 --port 8080 \
  -c 262144 -b 512 -ub 512 \
  --flash-attn on \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  --reasoning-budget 0 \
  --jinja

Key flags

--mmproj mmproj-F32.gguf — Required for vision/multimodal support
--spec-type draft-mtp --spec-draft-n-max 4 — Enable MTP speculative decoding (2× speedup)
--flash-attn on — Required for NVFP4 on Blackwell
--reasoning-budget 0 — Disable thinking mode for agentic coding tasks
--chat-template-file prompt.jinja — Use the Qwen3.6 MTP chat template

Reproduction

# Create tensor-type-file
cat > nvfp4-tensor-types.txt << 'TYPES'
attn_q=nvfp4 attn_k=nvfp4 attn_v=nvfp4 attn_qkv=nvfp4 attn_output=nvfp4 ffn_down=nvfp4 ffn_gate=nvfp4 ffn_up=nvfp4
TYPES

# Convert Q8_0 → NVFP4
llama-quantize \
  --allow-requantize \
  --tensor-type-file nvfp4-tensor-types.txt \
  Qwopus3.6-27B-Coder-MTP-Q8_0.gguf \
  Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
  Q4_K

Model Architecture

Qwopus3.6-27B-Coder is a LoRA/SFT fine-tune of Qwopus3.6-27B-v2 (itself built on Qwen3.6-27B), specialized for agentic coding with tool calling, debugging, and repository-level tasks. It retains the hybrid Mamba2-Transformer architecture with SSM layers interleaved with full attention layers every 4 blocks, plus an MTP head at the final layer.

Architecture: Hybrid Mamba2-Transformer (qwen35)
Parameters: 27B dense
Layers: 65 (64 base + 1 MTP)
Context: 262,144 tokens native
Vision: Yes (mmproj included)

Credits

Jackrong — Original Qwopus3.6-27B-Coder model and GGUF quantizations
Alibaba/Qwen — Qwen3.6-27B base model
Unsloth — Fine-tuning framework and reference NVFP4 quantization approach
llama.cpp — Inference engine with NVFP4 support

License

Apache-2.0 — same as the original model.

Downloads last month: 10,854

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

4-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support