Qwen3.6-27B-MTP ROCmFP4_FAST — GGUF

ROCmFP4_FAST quant of Qwen/Qwen3.6-27B (Apache 2.0), produced via charlie12345/ROCmFPX. Benchmarked against the mesh's agent-production regression suite on an RDNA4 RX 9060 XT (16 GB). Context ceiling: 131k at q4_0 KV with flat ~17.2 t/s throughput. Cross-format parity with TQ3_4S (Blackwell) on IFEval: both within 2pp.

MTP speculative decoding is a massive win on this card class: +91% throughput (32.93 t/s) at 64k q4_0 KV. The full model + MTP draft head fits in VRAM at 64k; no PCIe spill.

File

File Size Quant BPW
Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf 14.5 GB Q4_0_ROCMFP4_FAST ~4.25 bpw

NOT a stock llama.cpp quant

ROCmFP4_FAST is a custom weight format from charlie12345/ROCmFPX (Q4_0_ROCMFP4 preset). Stock llama.cpp will exit with unknown quantization at load time. The system_fingerprint of a correctly-served ROCmFPX GGUF is b1-11d76c2 — a different fingerprint means the wrong binary loaded the file.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough quant evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

  • Harness scope is bounded. The numbers below come from llama-bench (KV context ladder), lm-eval-harness IFEval n=50, and Hermes ctx_scaling_bench (KV precision sweep). That's a regression suite, not a quality benchmark.
  • Sample sizes are small. Throughput numbers are single-rep on a single GPU. IFEval is n=50. None are powered for multi-seed significance.
  • No perplexity / wikitext / MMLU. Those are upstream's territory. For a rigorous view, see charlie12345/ROCmFPX's own validation ladder.
  • Single GPU class (RDNA4 16 GB). All measurements on an RX 9060 XT (gfx1201), ROCm 7.2.3. No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan. Cross-hardware generalization is NOT implied. The companion TQ3_4S quant for Blackwell is in a separate repo.
  • No human eval. "IFEval parity and flat throughput" is not a quality verdict on this specific quant for every use case.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, see Qwen/Qwen3.6-27B (parent model), charlie12345/ROCmFPX (quantizer), and the meshina-benches repo for the full raw bench reports.

What we measured

Context ceiling (q4_0 KV, AMD RDNA4 RX 9060 XT)

Throughput is flat across the entire tested range. KV precision affects VRAM, not token speed.

KV type Ctx TG tok/s PP tok/s VRAM (MiB) Status
f16 32768 17.7 102.7 16174 OK
q4_0 65536 17.26 100.54 15282 OK
q4_0 98304 17.22 100.11 15857 OK
q4_0 131072 17.23 99.93 16282 OK
q8_0 65536 17.36 101.18 16236 OK

Ceilings: f16 KV caps at 32k, q8_0 at 64k, q4_0 reaches 131k (OOM at 152k — <200 MiB headroom). Asymmetric q8-K + q4-V fails at all ctx sizes.

IFEval n=50 — ROCmFP4_FAST (RDNA4)

Metric ROCmFP4_FAST (AMD) TQ3_4S (Blackwell) Δ
prompt_level_loose 0.32 ± 0.067 0.34 ± 0.068 -0.02
inst_level_loose 0.487 0.474 +0.013
inst_level_strict 0.487 0.461 +0.026

Cross-format parity holds within 2pp on all metrics.

MTP speculative decoding — the RDNA4 win

MTP is a massive throughput win on AMD RDNA4. The full model + 3 GB MTP draft head fits in VRAM at 64k, enabling on-GPU draft verification with no PCIe spill.

Config Ctx TG t/s vs MTP-OFF
MTP-OFF 32k-131k 17.2 baseline
MTP-ON n_max=3 (recommended) 64k 32.93 +91%
MTP-ON n_max=6 32k 28.92 +68% (worse)

Recommendation for AMD RDNA4: use MTP-ON with n_max=3 for contexts ≤64k. For >64k, drop to MTP-OFF (q4_0 KV reaches 131k).

Critical: Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 on discrete AMD cards. Charlie's ROCmFPX scripts default to this flag (correct for Strix Halo), but on discrete RDNA4 it moves the ENTIRE model to system RAM (30× regression — 0.97 t/s). The ROCmFP4_FAST quant + draft head fits in 16 GB VRAM at 64k without unified memory.

GSM8K note

AMD ROCmFP4_FAST scores 0.02 strict (1/50) on GSM8K. This is a genuine quality floor on the 27B model at this bit depth on 16 GB VRAM — the model is too tight for sustained eval without prompt cache operations crashing. Do not use this quant for math reasoning on 16 GB cards. The Blackwell TQ3_4S companion achieves 0.955 GSM8K at the same bit depth.

Quick start

# Build charlie12345/ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
make -j$(nproc)

# Serve (MTP-OFF, direct decode)
./bin/llama-server \
  -m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  --port 8081 \
  -ngl 99 \
  -c 32768 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --cache-ram 0 \
  --no-cache-prompt

# Serve (MTP-ON, n_max=3 — recommended for ≤64k ctx)
./bin/llama-server \
  -m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  --port 8081 \
  -ngl 99 \
  -c 65536 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --speculative-model m \
  --spec-draft-n-max 3 \
  --cache-ram 0 \
  --no-cache-prompt

Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 in the environment on discrete AMD cards.

Reproduce the quant

# From unsloth/Qwen3.6-27B-MTP-GGUF BF16 source (SHA256-validated, single-step)
/path/to/llama-quantize \
  --allow-requantize \
  /path/to/Qwen3.6-27B-MTP-BF16.gguf \
  /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  Q4_0_ROCMFP4_FAST

Files in this repo

File Purpose
Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf The quantized model (LFS-tracked, 14.5 GB)
README.md This file

Full raw bench reports, summary markdowns, and reproduction scripts are at github.com/maczzgit/meshina-benches in raw/benchmarks/2026-07-04-context-push-and-parity-v2/.

What's NOT in this repo (caveats)

  • Stock llama.cpp will not load this file. ROCmFP4_FAST is a custom weight format unique to charlie12345/ROCmFPX. Use that fork's llama-server.
  • No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression in charlie12345/ROCmFPX — we did not test it.
  • 131k ctx is the practical ceiling on this hardware due to VRAM. 256K requires larger VRAM or smaller model.
  • GSM8K is near-zero (0.02). The 27B model at 4-bit is too tight on 16 GB RDNA4 for sustained math reasoning. Use the companion TQ3_4S Blackwell quant for math tasks.
  • No vision/multimodal test. This variant is text-only.
  • No quality benchmark (perplexity, MMLU). The quant passes IFEval parity; whether it's "the best ROCmFP4 quant" needs upstream validation.
  • 16 GB minimum VRAM. Does not fit on smaller cards. The mesh's 16 GB card runs it with ~150 MiB headroom at 131k.

Provenance

License

The model weights are derived from Qwen/Qwen3.6-27B (Apache 2.0). The ROCmFP4 quant format is provided by charlie12345/ROCmFPX (MIT). This repo is a derivative quant — the Apache 2.0 license of the parent applies to the model weights; the quantizer tooling is separately licensed.

Downloads last month
-
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(549)
this model

Collection including maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF