Qwopus3.6-27B-Coder-NVFP4-MTP-GGUF

NVFP4 (Blackwell native FP4) quantized GGUF of Jackrong/Qwopus3.6-27B-Coder-MTP for llama.cpp with Multi-Token Prediction (MTP) speculative decoding support.

Quantization Details

Attribute Value
Source Q8_0 GGUF (28 GB, 8.50 BPW)
Output Mixed-precision NVFP4 (15 GB, 4.60 BPW)
Size reduction 46% (28 GB β†’ 15 GB)
NVFP4 tensors 311 (attn_q, attn_k, attn_v, attn_qkv, attn_output, ffn_down, ffn_gate, ffn_up)
Q4_K tensors 194 (attn_gate, ssm_alpha, ssm_beta, ssm_out, nextn.eh_proj, token_embd)
Q4_K_S tensors 1 (output.weight)
F32 tensors 360 (norms, biases, SSM state)
Total tensors 866

Tensor Mapping Strategy

Following Unsloth's NVFP4 approach for Qwen3.6-27B hybrid Mamba2-Transformer models:

  • NVFP4 β†’ 8 large weight tensor patterns that dominate model size and bandwidth (attention projections + FFN weights)
  • Q4_K β†’ Smaller weights (SSM parameters, MTP head projection, token embeddings) β€” preserves quality where tensor dimensions are small
  • F32 β†’ Norms, biases, and SSM state β€” must remain full precision for numerical stability

This mapping matches the reference Qwen3.6-27B-NVFP4-MTP quantization exactly.

Performance

NVIDIA DGX Spark (GB10, ARM64, 128 GB unified memory)

Config Decode Speed Prefill Draft Acceptance
Baseline (no spec) 14.4 tok/s 166 tok/s β€”
MTP nmax=4 25.4 tok/s 150 tok/s 47%

NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)

Expected performance based on the identical Qwopus3.6-27B-v2 NVFP4 architecture:

Config Decode Speed vs Q8_0
Baseline (no spec) ~79 tok/s 1.72Γ— faster
MTP nmax=4 ~136 tok/s 1.91Γ— faster

Usage

llama-server (recommended)

llama-server \
  --model Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
  --mmproj mmproj-F32.gguf \
  --chat-template-file prompt.jinja \
  --host 0.0.0.0 --port 8080 \
  -c 262144 -b 512 -ub 512 \
  --flash-attn on \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  --reasoning-budget 0 \
  --jinja

Key flags

  • --mmproj mmproj-F32.gguf β€” Required for vision/multimodal support
  • --spec-type draft-mtp --spec-draft-n-max 4 β€” Enable MTP speculative decoding (2Γ— speedup)
  • --flash-attn on β€” Required for NVFP4 on Blackwell
  • --reasoning-budget 0 β€” Disable thinking mode for agentic coding tasks
  • --chat-template-file prompt.jinja β€” Use the Qwen3.6 MTP chat template

Reproduction

# Create tensor-type-file
cat > nvfp4-tensor-types.txt << 'TYPES'
attn_q=nvfp4 attn_k=nvfp4 attn_v=nvfp4 attn_qkv=nvfp4 attn_output=nvfp4 ffn_down=nvfp4 ffn_gate=nvfp4 ffn_up=nvfp4
TYPES

# Convert Q8_0 β†’ NVFP4
llama-quantize \
  --allow-requantize \
  --tensor-type-file nvfp4-tensor-types.txt \
  Qwopus3.6-27B-Coder-MTP-Q8_0.gguf \
  Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
  Q4_K

Model Architecture

Qwopus3.6-27B-Coder is a LoRA/SFT fine-tune of Qwopus3.6-27B-v2 (itself built on Qwen3.6-27B), specialized for agentic coding with tool calling, debugging, and repository-level tasks. It retains the hybrid Mamba2-Transformer architecture with SSM layers interleaved with full attention layers every 4 blocks, plus an MTP head at the final layer.

  • Architecture: Hybrid Mamba2-Transformer (qwen35)
  • Parameters: 27B dense
  • Layers: 65 (64 base + 1 MTP)
  • Context: 262,144 tokens native
  • Vision: Yes (mmproj included)

Credits

  • Jackrong β€” Original Qwopus3.6-27B-Coder model and GGUF quantizations
  • Alibaba/Qwen β€” Qwen3.6-27B base model
  • Unsloth β€” Fine-tuning framework and reference NVFP4 quantization approach
  • llama.cpp β€” Inference engine with NVFP4 support

License

Apache-2.0 β€” same as the original model.

Downloads last month
10,854
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support