Qwen3.6-27B-MTP — ROCmFP4 STRIX (imatrix + f16 embeddings)

Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-27B (dense, with the built-in MTP / next-token-prediction head) in the custom ROCmFP4 4-bit format — tuned for high MTP draft acceptance and long-context, multi-turn coding use.

⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.8 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).

Requires the ROCmFP4 fork (public) — not stock llama.cpp

This file uses the ROCmFP4 tensor types (q4_0_rocmfp4, q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public fork charlie12345/rocmfp4-llama:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Two files in this repo (pick your trade-off)

File size output head best for
…-STRIX-imatrix-embF16.gguf 16.5 GB ROCmFP4 4-bit fastest — the original daily driver
…-STRIX-imatrix-embF16-headQ6.gguf 16.9 GB Q6_K a notch more faithful — trades ~5–7% decode for it

The two are identical except one tensor: same STRIX recipe, same f16 embeddings, same imatrix, same MTP head — they differ only in the output head (output.weight). Most of this card describes that shared recipe; the section right below is just about the one change.

The Q6-head variant — a step up (experimental)

The f16-embeddings note further down is the change I felt the most: full-precision token embeddings made the model follow instructions noticeably better. This variant does the same thing to the other end of the model — the output head that turns the final hidden state into the next-token choice — raising it from the 4-bit ROCmFP4 format to standard Q6_K, and leaving everything else untouched.

What I observed: a further step up in instruction-following — beyond what the f16 embeddings already gave. Subjectively it's more consistent at actually doing what it's told: reaching for the specific tool I asked for, and sticking to the rules/format of a task, more reliably than the f16-embeddings build alone. The embedding is the input side; the output head is the output side — sharpening both beats sharpening either.

How I checked it wasn't just a vibe. Two measurements, both on held-out text the model never trained or was calibrated on:

  • Perplexity — how well it predicts held-out text (lower is better). The Q6 head improved both code and prose, where the imatrix on its own only helped code:

    Test set daily (4-bit head) Q6 head
    held-out code 1.8596 1.8550
    held-out prose 5.7165 5.6761
  • KL divergence vs the original BF16 model — how closely its word-probabilities track the full-precision model it's a copy of (lower = more faithful). The Q6 head was closer to BF16 on every measure (mean ≈ 0.0369 → 0.0345, about 6% nearer the original). It still agrees with BF16's top word ~96% of the time either way — so the head mostly sharpens confidence on the same choice rather than flipping it, which is exactly what "follows the rules more consistently" feels like.

These are small but consistent gains — not night-and-day, but they move the right way across two different tests and two text types, which matches what I felt. Small internal checks, not formal benchmarks; reproduce before citing.

The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor, so decode is ~5–7% slower at short context (a couple tokens/sec on this hardware), and the gap shrinks at long context (the head is a fixed per-token cost that gets diluted as the KV cache grows). Size grows ~0.4 GB. For me the quality is worth it; if you want maximum speed, use the original file above.

Build it yourself — same as the daily driver, with one extra flag (--output-tensor-type q6_K):

llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Part 1 — The model

What this is

  • Base: unsloth/Qwen3.6-27B-MTP-GGUF BF16, pinned at revision 5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace. It carries the nextn_predict_layers=1 MTP head, so self-speculative draft-MTP survives quantization.
  • Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware: sensitive attention K/V on the dual-scale q4_0_rocmfp4, the bulk (FFN, lm-head) on the faster single-scale q4_0_rocmfp4_fast.
  • This variant (STRIX-imatrix-embF16):
    • f16 token embeddings (full precision — it's a lookup, so ~zero decode cost).
    • code-calibrated importance matrix (imatrix) applied to all 496 quantizable tensors.
value
File Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf
Size / bpw 16.5 GB / 4.82 bpw
token_embd F16
attention K/V (+ fused QKV) q4_0_rocmfp4 (dual-scale)
FFN, lm-head, rest q4_0_rocmfp4_fast
MTP head preserved (blk.64.nextn.*)

How it was built (reproducible)

Calibration corpus (code_calibration.txt): a concatenation of three files from the froggeric/imatrix dataset — groups_merged.txt + code.txt + technical.txt (~646 KB total) — code-heavy but diverse enough to avoid domain overfitting. The resulting imatrix (qwen3.6-27b-code.imatrix, 339 chunks) is included in this repo, so you can reproduce the quant exactly without recomputing it.

# 1) importance matrix
llama-imatrix -m Qwen3.6-27B-BF16-00001-of-00002.gguf \
  -f code_calibration.txt -o qwen3.6-27b-code.imatrix \
  -dev Vulkan0 -ngl 999 -fa on -c 512

# 2) quantize: quality-biased STRIX preset + f16 embeddings + imatrix
llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

Quality (internal perplexity, directional only)

Held-out perplexity at n_ctx=512, vs the same quant without imatrix (embeddings f16 in both):

Test set no-imatrix this (imatrix)
held-out code 1.8631 1.8596
held-out prose 5.7109 5.7165

Tiny improvement on code (the calibration domain), neutral on prose — expected at this bit rate (at 4+ bpw the base quant is already close to the original, so imatrix is a polish, not a transformation). Small internal checks, not rigorous benchmarks; reproduce before citing.

Status & caveats

Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

Credits & license

  • Base model: Qwen3.6-27B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
  • BF16 GGUF source: unsloth/Qwen3.6-27B-MTP-GGUF @ 5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace.
  • ROCmFP4 format & runtime: charlie12345/rocmfp4-llama (based on llama.cpp, MIT).

Part 2 — Making practical use of it

What I observed (the direction here)

These are hands-on observations from daily use on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0) — not benchmarks, but the direction I was exploring:

  • Raising the token-embedding layer to full precision (f16) made the model follow instructions noticeably better. It was the single change I felt the most — the embedding is the foundation every layer builds on, and the model has a very large vocab, so a faithful embedding pays off. It costs almost nothing on speed because the embedding is a lookup, not a matmul.
  • The code-calibrated imatrix is a free polish on top (same size and speed) — small, but in the right direction on code.
  • It's fast and genuinely usable day-to-day: MTP self-speculative decoding with full-precision KV gives ~0.87–0.90 draft acceptance, and it holds up at long context.
  • It pairs especially well with my OpenCode fork (below), which keeps the prompt cache intact across history compaction — so long coding sessions don't re-prefill every turn.

Run config (highest MTP acceptance on Strix Halo)

Full-precision (f16) KV is the dominant acceptance lever here — it raised draft acceptance to ~0.87–0.90 warm (vs ~0.70–0.76 with q8/q4 KV). 128 GB unified RAM affords it; on less memory drop to -ctk q8_0 -ctv q8_0 (lower acceptance).

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
  --alias qwen3.6-27b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --presence-penalty 0.0 --repeat-penalty 1.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap
Flag Why
-dev Vulkan0 Vulkan (KHR_coopmat) beats ROCm/HIP here — ~+1.7× prefill
-ub 256 prefill optimum on this APU; bigger ubatch is slower
-ctk f16 -ctv f16 full-precision main KV — the dominant MTP-acceptance lever
--spec-type draft-mtp + f16 draft KV use the model's built-in MTP head; f16 draft KV keeps acceptance high
--temp 0.6 ... Qwen3.6 "precise coding" sampling (temp 1.0 for general tasks)

Decode (this hardware): ~33 t/s short context, ~18 t/s at ~140K. It's a hybrid SSM + attention model (48 SSM + 17 attention blocks), so only the attention layers grow a KV cache — it degrades gracefully at long context.

Multi-turn prompt-cache reuse (the part that makes it usable)

Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn; both are fixed by flags above:

  1. Checkpoint cadence. Default -cpent is 8192, so prompts under 8K never get a usable checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256 (checkpoint every 256 tokens, keep 32, reuse a matching prefix of ≥256 tokens). Verified: a shared 3,000-token prefix re-prefill dropped 12.4 s → ~0.1 s.
  2. Thinking text breaking the prefix match. --reasoning-format controls where <think> goes:
    • deepseek (used here) → clean content + reasoning_content, auto-paired with --chat-template-kwargs '{"preserve_thinking": true}' so the Jinja template keeps <think> for all turns. Reuse holds if the client echoes reasoning_content back — and with OpenCode the large stable leading context reuses via checkpoints regardless.
    • none → leaves <think> inline in content, so any content-echoing client gets reuse (raw tags show inline). deepseek-legacy/auto do not reuse.
  3. Vision projector kills reuse. Loading --mmproj disables cache reuse entirely; keep vision off for text/code.

--jinja is required so the chat template (and preserve_thinking) apply.

OpenCode + my fork

Point OpenCode at the server as an OpenAI-compatible provider. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label (it does not have to match --alias). The provider below is named lmstudio only because it uses the generic OpenAI-compatible adapter — it points at this llama-server, not LM Studio.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "local llama-server (ROCmFP4)",
      "options": { "baseURL": "http://<host>:8080/v1", "apiKey": "sk-local" },
      "models": {
        "qwen3.6-27b-mtp": {
          "name": "Qwen 3.6 27B",
          "limit": { "context": 262144, "output": 32768 }
        }
      }
    }
  },
  "model": "lmstudio/qwen3.6-27b-mtp",
  "compaction": { "auto": true, "reserved": 16384 }
}

Project-local opencode.json — disable the task tool so agents don't spawn subagents, keeping the whole session on one cache-friendly context:

{
  "$schema": "https://opencode.ai/config.json",
  "agent": {
    "build": { "tools": { "task": false } },
    "plan":  { "tools": { "task": false } }
  }
}

The fork: PlunderStruck/opencode. compaction.auto summarizes history when the context fills — which in stock OpenCode rewrites the leading prompt and invalidates the cache, forcing a full re-prefill. This fork compacts without breaking the cached prefix (plus a few other adjustments), so cache reuse survives compaction. Paired with the checkpoint flags above, long sessions stay fast and actually usable.

Downloads last month
371
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(2)
this model