plunderstruck/Qwopus3.6-27B-Coder-MTP-ROCmFP4-GGUF

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWOPUS3.6-27B-CODER-MTP
4-BIT ROCmFP4 · MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      SIZE
16 GB

      CONTEXT
262 K

    

      DRAFT
MTP n-max 5

      VISION
QWEN3-VL

      BACKEND
VULKAN0

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.4–4.5 bpw 4-bit files; pick by filename.

01 · FILES

File	Size	Output head	Pick if
`…-STRIX-embF16-headQ6.gguf` ★	16.9 GB	Q6_K	the one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for the Qwopus coder. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + the preserved MTP head, and ships no imatrix (deliberate — imatrix worsened this coder's code-PPL, see §05). Not the leanest-fastest possible (a Q5-embedding build squeezes out a few more tok/s, at a quality cost you'll notice), and not the most faithful possible (see the Jackrong fidelity link in §05) — it's the point where speed and quality meet best. Repo also bundles chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|> + vision).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  --alias qwopus-coder \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision (Qwen3-VL) — omit them for text-only. The mmproj-F32.gguf projector is bundled in this repo (Qwen3-VL, projection_dim 5120). --image-min-tokens 1024 is required whenever --mmproj is set — fewer image tokens and Qwen-VL misreads fine detail (see §04).

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 recommended sampling
`--spec-type draft-mtp · --spec-draft-n-max 5`	built-in MTP head, self-speculative; draft depth 5 (measured optimum)
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 KV
`--chat-template-file chat_template.jinja`	bundled froggeric template (tool calls + think-toggle + vision)
`--reasoning-format deepseek + kwargs {enable_thinking:false, preserve_thinking:true}`	thinking-off for agentic use, but keep cross-turn cache (~86% reuse)
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM
`--mmproj <Qwen3-VL projector>` (vision · optional)	enable image input — any Qwen3-VL projector with `projection_dim 5120` (the Qwen3.6-27B one, f16/f32); see §04
`--image-min-tokens 1024` (vision)	required whenever `--mmproj` is set — fewer image tokens and Qwen-VL misreads fine detail (e.g. OCR)

03 · CODING AGENT

Run thinking-off — it commits straight to the tool call instead of over-planning in <think>. Naive enable_thinking:false breaks cross-turn prompt-cache reuse (measured 0% → full re-prefill); add preserve_thinking:true and the cache stays (~86% reuse measured):

--reasoning-format deepseek --chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}'

OpenCode — via my fork PlunderStruck/opencode (compaction doesn't rewrite the leading prompt → cache survives long sessions). The model must be tool_call: true, or OpenCode won't send tools natively and the model just narrates code instead of calling tools:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "strix": {
      "npm": "@ai-sdk/openai-compatible",
      "options": { "baseURL": "http://<server-ip>:8080/v1" },
      "models": { "qwopus-coder": { "tool_call": true, "limit": { "context": 262144, "output": 65536 } } }
    }
  },
  "model": "strix/qwopus-coder"
}

04 · VISION

Qwen3-VL lineage — vision works by adding the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120), shipped in this repo:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail

NOTE // thinking model → for one-shot image Q&A use <|think_off|> or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.

05 · PERFORMANCE & QUALITY

DECODE · thinking-off	~34–38 t/s (Vulkan / Strix Halo)
MTP DRAFT ACCEPTANCE · code	~0.8
BIGCODEBENCH HARD · instruct · pass@1	46/148 (31.1%) · thinking-off, greedy
QUANTIZATION	non-imatrix (measured better for code)

UPSTREAM BENCHMARK // Published by Jackrong for the base Qwopus-Coder — NOT re-measured on this ROCmFP4 quant: SWE-bench Verified 335/500 = 67.0%, run thinking-off on the Qwopus-3.6-27B-Coder Q5_K_M GGUF. Our quant is a different 4-bit ROCmFP4 build; we have not re-run SWE-bench.

Why no imatrix (we measured it): a code-weighted importance matrix improved fidelity-to-BF16 (median KL −15%, top-token +0.6 pp) but measurably worsened held-out-code perplexity (+2.6%, significant). For a coder, code-prediction is the task-relevant metric, so we shipped the non-imatrix quant. (On Qwen3-Coder-Next the imatrix was a clean win — it's model-dependent.)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. We swept the same rocmfp4 levers we mapped in detail on the base Qwen3.6-27B (embedding precision, output-head precision, fast single-scale vs all-dual-scale body) and the frontier landed in the same place for the coder: an all-dual-scale body trims worst-case token divergences only inside the measurement noise while costing decode speed, and top-token agreement is tied — so greedy output is effectively identical and the fast single-scale body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one quality lever that's actually felt; we keep full f16 embeddings.

So the recipe is the same one the base-model sweep settled on — fast single-scale body + f16 embeddings + Q6 output head — applied here over Jackrong's tuned coder, and shipped non-imatrix (above) because that's what wins on code-prediction. See the base 27B card's §05 for the full lever-by-lever sweep, KL/decode frontier table, and the format-limit discussion; we don't re-print its numbers here.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Jackrong's higher-bit Qwopus3.6-27B-Coder-MTP-GGUF (standard K-quants) run on this same fork — roughly ~2× lower KL divergence vs BF16, at slower decode, and MTP still works. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

06 · BUILD (REPRODUCIBLE)

# from Jackrong's BF16+MTP GGUF -> ROCmFP4, genuine f16 embeddings, no imatrix
llama-quantize --token-embedding-type f16 \
  Qwopus3.6-27B-Coder-MTP-BF16.gguf \
  Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16.gguf  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K \
  Qwopus3.6-27B-Coder-MTP-BF16.gguf \
  Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

07 · LINEAGE & CREDITS

BASE MODEL	Jackrong/Qwopus3.6-27B-Coder-MTP (Apache-2.0) · from Qwen3.6-27B (Qwen)
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (llama.cpp, MIT)
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: 6,757

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including plunderstruck/Qwopus3.6-27B-Coder-MTP-ROCmFP4-GGUF

ROCmFP4 MTP · Strix Halo

Collection

Self-speculative MTP quants in custom ROCmFP4 4-bit for AMD Strix Halo (gfx1151). Needs the charlie12345/rocmfp4-llama fork. • 5 items • Updated 2 days ago • 2

FORMAT ROCmFP4 4-BIT	PRECISION ~4.5 BPW	SIZE 16 GB	CONTEXT 262 K
DRAFT MTP n-max 5	VISION QWEN3-VL	BACKEND VULKAN0	LICENSE APACHE-2.0