Instructions to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF",
	filename="Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Use Docker

docker model run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
```

Unsloth Studio

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Run Hermes

hermes

Docker Model Runner
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
```

Lemonade

How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-ROCmFP4-GGUF-BF16

List all available models

lemonade list

Qwen3.6-27B-MTP — ROCmFP4 STRIX (imatrix + f16 embeddings)

Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-27B (dense, with the built-in MTP / next-token-prediction head) in the custom ROCmFP4 4-bit format — tuned for high MTP draft acceptance and long-context, multi-turn coding use.

⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.8 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).

Requires the ROCmFP4 fork (public) — not stock llama.cpp

This file uses the ROCmFP4 tensor types (q4_0_rocmfp4, q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public fork charlie12345/rocmfp4-llama:
git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Two files in this repo (pick your trade-off)

File	size	output head	best for
`…-STRIX-imatrix-embF16.gguf`	16.5 GB	ROCmFP4 4-bit	fastest — the original daily driver
`…-STRIX-imatrix-embF16-headQ6.gguf`	16.9 GB	Q6_K	a notch more faithful — trades ~5–7% decode for it

The two are identical except one tensor: same STRIX recipe, same f16 embeddings, same imatrix, same MTP head — they differ only in the output head (output.weight). Most of this card describes that shared recipe; the section right below is just about the one change.

The Q6-head variant — a step up (experimental)

The f16-embeddings note further down is the change I felt the most: full-precision token embeddings made the model follow instructions noticeably better. This variant does the same thing to the other end of the model — the output head that turns the final hidden state into the next-token choice — raising it from the 4-bit ROCmFP4 format to standard Q6_K, and leaving everything else untouched.

What I observed: a further step up in instruction-following — beyond what the f16 embeddings already gave. Subjectively it's more consistent at actually doing what it's told: reaching for the specific tool I asked for, and sticking to the rules/format of a task, more reliably than the f16-embeddings build alone. The embedding is the input side; the output head is the output side — sharpening both beats sharpening either.

How I checked it wasn't just a vibe. Two measurements, both on held-out text the model never trained or was calibrated on:

Perplexity — how well it predicts held-out text (lower is better). The Q6 head improved both code and prose, where the imatrix on its own only helped code:

Test set daily (4-bit head) Q6 head

held-out code 1.8596 1.8550

held-out prose 5.7165 5.6761
KL divergence vs the original BF16 model — how closely its word-probabilities track the full-precision model it's a copy of (lower = more faithful). The Q6 head was closer to BF16 on every measure (mean ≈ 0.0369 → 0.0345, about 6% nearer the original). It still agrees with BF16's top word ~96% of the time either way — so the head mostly sharpens confidence on the same choice rather than flipping it, which is exactly what "follows the rules more consistently" feels like.

Test set	daily (4-bit head)	Q6 head
held-out code	1.8596	1.8550
held-out prose	5.7165	5.6761

These are small but consistent gains — not night-and-day, but they move the right way across two different tests and two text types, which matches what I felt. Small internal checks, not formal benchmarks; reproduce before citing.

The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor, so decode is ~5–7% slower at short context (a couple tokens/sec on this hardware), and the gap shrinks at long context (the head is a fixed per-token cost that gets diluted as the KV cache grows). Size grows ~0.4 GB. For me the quality is worth it; if you want maximum speed, use the original file above.

Build it yourself — same as the daily driver, with one extra flag (--output-tensor-type q6_K):

llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Part 1 — The model

What this is

Base: unsloth/Qwen3.6-27B-MTP-GGUF BF16, pinned at revision 5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace. It carries the nextn_predict_layers=1 MTP head, so self-speculative draft-MTP survives quantization.
Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware: sensitive attention K/V on the dual-scale q4_0_rocmfp4, the bulk (FFN, lm-head) on the faster single-scale q4_0_rocmfp4_fast.
This variant (STRIX-imatrix-embF16):
- f16 token embeddings (full precision — it's a lookup, so ~zero decode cost).
- code-calibrated importance matrix (imatrix) applied to all 496 quantizable tensors.

	value
File	`Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf`
Size / bpw	16.5 GB / 4.82 bpw
token_embd	F16
attention K/V (+ fused QKV)	`q4_0_rocmfp4` (dual-scale)
FFN, lm-head, rest	`q4_0_rocmfp4_fast`
MTP head	preserved (`blk.64.nextn.*`)

How it was built (reproducible)

Calibration corpus (code_calibration.txt): a concatenation of three files from the froggeric/imatrix dataset — groups_merged.txt + code.txt + technical.txt (~646 KB total) — code-heavy but diverse enough to avoid domain overfitting. The resulting imatrix (qwen3.6-27b-code.imatrix, 339 chunks) is included in this repo, so you can reproduce the quant exactly without recomputing it.

# 1) importance matrix
llama-imatrix -m Qwen3.6-27B-BF16-00001-of-00002.gguf \
  -f code_calibration.txt -o qwen3.6-27b-code.imatrix \
  -dev Vulkan0 -ngl 999 -fa on -c 512

# 2) quantize: quality-biased STRIX preset + f16 embeddings + imatrix
llama-quantize \
  --imatrix qwen3.6-27b-code.imatrix \
  --token-embedding-type f16 \
  Qwen3.6-27B-BF16-00001-of-00002.gguf \
  Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

Quality (internal perplexity, directional only)

Held-out perplexity at n_ctx=512, vs the same quant without imatrix (embeddings f16 in both):

Test set	no-imatrix	this (imatrix)
held-out code	1.8631	1.8596
held-out prose	5.7109	5.7165

Tiny improvement on code (the calibration domain), neutral on prose — expected at this bit rate (at 4+ bpw the base quant is already close to the original, so imatrix is a polish, not a transformation). Small internal checks, not rigorous benchmarks; reproduce before citing.

Status & caveats

Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

Credits & license

Base model: Qwen3.6-27B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
BF16 GGUF source: unsloth/Qwen3.6-27B-MTP-GGUF @ 5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace.
ROCmFP4 format & runtime: charlie12345/rocmfp4-llama (based on llama.cpp, MIT).

Part 2 — Making practical use of it

What I observed (the direction here)

These are hands-on observations from daily use on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0) — not benchmarks, but the direction I was exploring:

Raising the token-embedding layer to full precision (f16) made the model follow instructions noticeably better. It was the single change I felt the most — the embedding is the foundation every layer builds on, and the model has a very large vocab, so a faithful embedding pays off. It costs almost nothing on speed because the embedding is a lookup, not a matmul.
The code-calibrated imatrix is a free polish on top (same size and speed) — small, but in the right direction on code.
It's fast and genuinely usable day-to-day: MTP self-speculative decoding with full-precision KV gives ~0.87–0.90 draft acceptance, and it holds up at long context.
It pairs especially well with my OpenCode fork (below), which keeps the prompt cache intact across history compaction — so long coding sessions don't re-prefill every turn.

Run config (highest MTP acceptance on Strix Halo)

Full-precision (f16) KV is the dominant acceptance lever here — it raised draft acceptance to ~0.87–0.90 warm (vs ~0.70–0.76 with q8/q4 KV). 128 GB unified RAM affords it; on less memory drop to -ctk q8_0 -ctv q8_0 (lower acceptance).

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
  --alias qwen3.6-27b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --presence-penalty 0.0 --repeat-penalty 1.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap

Flag	Why
`-dev Vulkan0`	Vulkan (KHR_coopmat) beats ROCm/HIP here — ~+1.7× prefill
`-ub 256`	prefill optimum on this APU; bigger ubatch is slower
`-ctk f16 -ctv f16`	full-precision main KV — the dominant MTP-acceptance lever
`--spec-type draft-mtp` + f16 draft KV	use the model's built-in MTP head; f16 draft KV keeps acceptance high
`--temp 0.6 ...`	Qwen3.6 "precise coding" sampling (temp 1.0 for general tasks)

Decode (this hardware): ~33 t/s short context, ~18 t/s at ~140K. It's a hybrid SSM + attention model (48 SSM + 17 attention blocks), so only the attention layers grow a KV cache — it degrades gracefully at long context.

Multi-turn prompt-cache reuse (the part that makes it usable)

Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn; both are fixed by flags above:

Checkpoint cadence. Default -cpent is 8192, so prompts under 8K never get a usable checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256 (checkpoint every 256 tokens, keep 32, reuse a matching prefix of ≥256 tokens). Verified: a shared 3,000-token prefix re-prefill dropped 12.4 s → ~0.1 s.
Thinking text breaking the prefix match. --reasoning-format controls where <think> goes:
- deepseek (used here) → clean content + reasoning_content, auto-paired with --chat-template-kwargs '{"preserve_thinking": true}' so the Jinja template keeps <think> for all turns. Reuse holds if the client echoes reasoning_content back — and with OpenCode the large stable leading context reuses via checkpoints regardless.
- none → leaves <think> inline in content, so any content-echoing client gets reuse (raw tags show inline). deepseek-legacy/auto do not reuse.
Vision projector kills reuse. Loading --mmproj disables cache reuse entirely; keep vision off for text/code.

--jinja is required so the chat template (and preserve_thinking) apply.

OpenCode + my fork

Point OpenCode at the server as an OpenAI-compatible provider. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label (it does not have to match --alias). The provider below is named lmstudio only because it uses the generic OpenAI-compatible adapter — it points at this llama-server, not LM Studio.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "lmstudio": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "local llama-server (ROCmFP4)",
      "options": { "baseURL": "http://<host>:8080/v1", "apiKey": "sk-local" },
      "models": {
        "qwen3.6-27b-mtp": {
          "name": "Qwen 3.6 27B",
          "limit": { "context": 262144, "output": 32768 }
        }
      }
    }
  },
  "model": "lmstudio/qwen3.6-27b-mtp",
  "compaction": { "auto": true, "reserved": 16384 }
}

Project-local opencode.json — disable the task tool so agents don't spawn subagents, keeping the whole session on one cache-friendly context:

{
  "$schema": "https://opencode.ai/config.json",
  "agent": {
    "build": { "tools": { "task": false } },
    "plan":  { "tools": { "task": false } }
  }
}

The fork: PlunderStruck/opencode. compaction.auto summarizes history when the context fills — which in stock OpenCode rewrites the leading prompt and invalidates the cache, forcing a full re-prefill. This fork compacts without breaking the cached prefix (plus a few other adjustments), so cache reuse survives compaction. Paired with the checkpoint flags above, long sessions stay fast and actually usable.

Downloads last month: 371

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

unsloth/Qwen3.6-27B-MTP-GGUF

Quantized

(2)

this model