PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWOPUS3.6-27B-CODER-MTP
4-BIT ROCmFP4 · MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
SIZE
16 GB
CONTEXT
262 K
DRAFT
MTP n-max 5
VISION
QWEN3-VL
BACKEND
VULKAN0
LICENSE
APACHE-2.0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.4–4.5 bpw 4-bit files; pick by filename.
01 · FILES
File Size Output head Pick if
…-STRIX-embF16-headQ6.gguf16.9 GBQ6_Kthe one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for the Qwopus coder. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + the preserved MTP head, and ships no imatrix (deliberate — imatrix worsened this coder's code-PPL, see §05). Not the leanest-fastest possible (a Q5-embedding build squeezes out a few more tok/s, at a quality cost you'll notice), and not the most faithful possible (see the Jackrong fidelity link in §05) — it's the point where speed and quality meet best. Repo also bundles chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|> + vision).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  --alias qwopus-coder \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision (Qwen3-VL) — omit them for text-only. The mmproj-F32.gguf projector is bundled in this repo (Qwen3-VL, projection_dim 5120). --image-min-tokens 1024 is required whenever --mmproj is set — fewer image tokens and Qwen-VL misreads fine detail (see §04).

Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch · CPU threads
-ctk f16 · -ctv f16f16 KV cache — how we run it; drop to q8_0/q4_0 to use less memory
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0Qwen3.6 recommended sampling
--spec-type draft-mtp · --spec-draft-n-max 5built-in MTP head, self-speculative; draft depth 5 (measured optimum)
--spec-draft-device Vulkan0 · -ngl all · type-k/v f16draft head on Vulkan, fully offloaded, f16 KV
--chat-template-file chat_template.jinjabundled froggeric template (tool calls + think-toggle + vision)
--reasoning-format deepseek + kwargs {enable_thinking:false, preserve_thinking:true}thinking-off for agentic use, but keep cross-turn cache (~86% reuse)
--jinja --parallel 1 --metrics --no-mmapapply template · single slot · metrics · weights in RAM
--mmproj <Qwen3-VL projector>  (vision · optional)enable image input — any Qwen3-VL projector with projection_dim 5120 (the Qwen3.6-27B one, f16/f32); see §04
--image-min-tokens 1024  (vision)required whenever --mmproj is set — fewer image tokens and Qwen-VL misreads fine detail (e.g. OCR)
03 · CODING AGENT

Run thinking-off — it commits straight to the tool call instead of over-planning in <think>. Naive enable_thinking:false breaks cross-turn prompt-cache reuse (measured 0% → full re-prefill); add preserve_thinking:true and the cache stays (~86% reuse measured):

--reasoning-format deepseek --chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}'

OpenCode — via my fork PlunderStruck/opencode (compaction doesn't rewrite the leading prompt → cache survives long sessions). The model must be tool_call: true, or OpenCode won't send tools natively and the model just narrates code instead of calling tools:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "strix": {
      "npm": "@ai-sdk/openai-compatible",
      "options": { "baseURL": "http://<server-ip>:8080/v1" },
      "models": { "qwopus-coder": { "tool_call": true, "limit": { "context": 262144, "output": 65536 } } }
    }
  },
  "model": "strix/qwopus-coder"
}
04 · VISION

Qwen3-VL lineage — vision works by adding the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120), shipped in this repo:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
NOTE // thinking model → for one-shot image Q&A use <|think_off|> or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.
05 · PERFORMANCE & QUALITY
DECODE · thinking-off~34–38 t/s (Vulkan / Strix Halo)
MTP DRAFT ACCEPTANCE · code~0.8
BIGCODEBENCH HARD · instruct · pass@146/148 (31.1%) · thinking-off, greedy
QUANTIZATIONnon-imatrix (measured better for code)
UPSTREAM BENCHMARK // Published by Jackrong for the base Qwopus-Coder — NOT re-measured on this ROCmFP4 quant: SWE-bench Verified 335/500 = 67.0%, run thinking-off on the Qwopus-3.6-27B-Coder Q5_K_M GGUF. Our quant is a different 4-bit ROCmFP4 build; we have not re-run SWE-bench.

Why no imatrix (we measured it): a code-weighted importance matrix improved fidelity-to-BF16 (median KL −15%, top-token +0.6 pp) but measurably worsened held-out-code perplexity (+2.6%, significant). For a coder, code-prediction is the task-relevant metric, so we shipped the non-imatrix quant. (On Qwen3-Coder-Next the imatrix was a clean win — it's model-dependent.)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. We swept the same rocmfp4 levers we mapped in detail on the base Qwen3.6-27B (embedding precision, output-head precision, fast single-scale vs all-dual-scale body) and the frontier landed in the same place for the coder: an all-dual-scale body trims worst-case token divergences only inside the measurement noise while costing decode speed, and top-token agreement is tied — so greedy output is effectively identical and the fast single-scale body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one quality lever that's actually felt; we keep full f16 embeddings.

So the recipe is the same one the base-model sweep settled on — fast single-scale body + f16 embeddings + Q6 output head — applied here over Jackrong's tuned coder, and shipped non-imatrix (above) because that's what wins on code-prediction. See the base 27B card's §05 for the full lever-by-lever sweep, KL/decode frontier table, and the format-limit discussion; we don't re-print its numbers here.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Jackrong's higher-bit Qwopus3.6-27B-Coder-MTP-GGUF (standard K-quants) run on this same fork — roughly ~2× lower KL divergence vs BF16, at slower decode, and MTP still works. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.
06 · BUILD (REPRODUCIBLE)
# from Jackrong's BF16+MTP GGUF -> ROCmFP4, genuine f16 embeddings, no imatrix
llama-quantize --token-embedding-type f16 \
  Qwopus3.6-27B-Coder-MTP-BF16.gguf \
  Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16.gguf  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K \
  Qwopus3.6-27B-Coder-MTP-BF16.gguf \
  Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

07 · LINEAGE & CREDITS
BASE MODELJackrong/Qwopus3.6-27B-Coder-MTP (Apache-2.0) · from Qwen3.6-27B (Qwen)
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (llama.cpp, MIT)
CHAT TEMPLATEfroggeric/Qwen-Fixed-Chat-Templates

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month
6,757
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including plunderstruck/Qwopus3.6-27B-Coder-MTP-ROCmFP4-GGUF