Qwopus3.6-27B-Coder-NVFP4-MTP-GGUF
NVFP4 (Blackwell native FP4) quantized GGUF of Jackrong/Qwopus3.6-27B-Coder-MTP for llama.cpp with Multi-Token Prediction (MTP) speculative decoding support.
Quantization Details
| Attribute | Value |
|---|---|
| Source | Q8_0 GGUF (28 GB, 8.50 BPW) |
| Output | Mixed-precision NVFP4 (15 GB, 4.60 BPW) |
| Size reduction | 46% (28 GB β 15 GB) |
| NVFP4 tensors | 311 (attn_q, attn_k, attn_v, attn_qkv, attn_output, ffn_down, ffn_gate, ffn_up) |
| Q4_K tensors | 194 (attn_gate, ssm_alpha, ssm_beta, ssm_out, nextn.eh_proj, token_embd) |
| Q4_K_S tensors | 1 (output.weight) |
| F32 tensors | 360 (norms, biases, SSM state) |
| Total tensors | 866 |
Tensor Mapping Strategy
Following Unsloth's NVFP4 approach for Qwen3.6-27B hybrid Mamba2-Transformer models:
- NVFP4 β 8 large weight tensor patterns that dominate model size and bandwidth (attention projections + FFN weights)
- Q4_K β Smaller weights (SSM parameters, MTP head projection, token embeddings) β preserves quality where tensor dimensions are small
- F32 β Norms, biases, and SSM state β must remain full precision for numerical stability
This mapping matches the reference Qwen3.6-27B-NVFP4-MTP quantization exactly.
Performance
NVIDIA DGX Spark (GB10, ARM64, 128 GB unified memory)
| Config | Decode Speed | Prefill | Draft Acceptance |
|---|---|---|---|
| Baseline (no spec) | 14.4 tok/s | 166 tok/s | β |
| MTP nmax=4 | 25.4 tok/s | 150 tok/s | 47% |
NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM)
Expected performance based on the identical Qwopus3.6-27B-v2 NVFP4 architecture:
| Config | Decode Speed | vs Q8_0 |
|---|---|---|
| Baseline (no spec) | ~79 tok/s | 1.72Γ faster |
| MTP nmax=4 | ~136 tok/s | 1.91Γ faster |
Usage
llama-server (recommended)
llama-server \
--model Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
--mmproj mmproj-F32.gguf \
--chat-template-file prompt.jinja \
--host 0.0.0.0 --port 8080 \
-c 262144 -b 512 -ub 512 \
--flash-attn on \
--spec-type draft-mtp --spec-draft-n-max 4 \
--reasoning-budget 0 \
--jinja
Key flags
--mmproj mmproj-F32.ggufβ Required for vision/multimodal support--spec-type draft-mtp --spec-draft-n-max 4β Enable MTP speculative decoding (2Γ speedup)--flash-attn onβ Required for NVFP4 on Blackwell--reasoning-budget 0β Disable thinking mode for agentic coding tasks--chat-template-file prompt.jinjaβ Use the Qwen3.6 MTP chat template
Reproduction
# Create tensor-type-file
cat > nvfp4-tensor-types.txt << 'TYPES'
attn_q=nvfp4 attn_k=nvfp4 attn_v=nvfp4 attn_qkv=nvfp4 attn_output=nvfp4 ffn_down=nvfp4 ffn_gate=nvfp4 ffn_up=nvfp4
TYPES
# Convert Q8_0 β NVFP4
llama-quantize \
--allow-requantize \
--tensor-type-file nvfp4-tensor-types.txt \
Qwopus3.6-27B-Coder-MTP-Q8_0.gguf \
Qwopus3.6-27B-Coder-MTP-NVFP4.gguf \
Q4_K
Model Architecture
Qwopus3.6-27B-Coder is a LoRA/SFT fine-tune of Qwopus3.6-27B-v2 (itself built on Qwen3.6-27B), specialized for agentic coding with tool calling, debugging, and repository-level tasks. It retains the hybrid Mamba2-Transformer architecture with SSM layers interleaved with full attention layers every 4 blocks, plus an MTP head at the final layer.
- Architecture: Hybrid Mamba2-Transformer (qwen35)
- Parameters: 27B dense
- Layers: 65 (64 base + 1 MTP)
- Context: 262,144 tokens native
- Vision: Yes (mmproj included)
Credits
- Jackrong β Original Qwopus3.6-27B-Coder model and GGUF quantizations
- Alibaba/Qwen β Qwen3.6-27B base model
- Unsloth β Fine-tuning framework and reference NVFP4 quantization approach
- llama.cpp β Inference engine with NVFP4 support
License
Apache-2.0 β same as the original model.
- Downloads last month
- 10,854
4-bit