▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT |
PRECISION ~4.5 BPW |
SIZE 16 GB |
CONTEXT 262 K |
DRAFT MTP n-max 5 |
VISION QWEN3-VL |
BACKEND VULKAN0 |
LICENSE APACHE-2.0 |
The custom
q4_0_rocmfp4 tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
One file — the best speed/quality balance in ROCmFP4 for the Qwopus coder. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + the preserved MTP head, and ships no imatrix (deliberate — imatrix worsened this coder's code-PPL, see §05). Not the leanest-fastest possible (a Q5-embedding build squeezes out a few more tok/s, at a quality cost you'll notice), and not the most faithful possible (see the Jackrong fidelity link in §05) — it's the point where speed and quality meet best. Repo also bundles chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|> + vision).
Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
--alias qwopus-coder \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 262144 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning-format deepseek \
--chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}' \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024
The last two lines enable vision (Qwen3-VL) — omit them for text-only. The mmproj-F32.gguf projector is bundled in this repo (Qwen3-VL, projection_dim 5120). --image-min-tokens 1024 is required whenever --mmproj is set — fewer image tokens and Qwen-VL misreads fine detail (see §04).
Run thinking-off — it commits straight to the tool call instead of over-planning in <think>. Naive enable_thinking:false breaks cross-turn prompt-cache reuse (measured 0% → full re-prefill); add preserve_thinking:true and the cache stays (~86% reuse measured):
--reasoning-format deepseek --chat-template-kwargs '{"enable_thinking": false, "preserve_thinking": true}'
OpenCode — via my fork PlunderStruck/opencode (compaction doesn't rewrite the leading prompt → cache survives long sessions). The model must be tool_call: true, or OpenCode won't send tools natively and the model just narrates code instead of calling tools:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"strix": {
"npm": "@ai-sdk/openai-compatible",
"options": { "baseURL": "http://<server-ip>:8080/v1" },
"models": { "qwopus-coder": { "tool_call": true, "limit": { "context": 262144, "output": 65536 } } }
}
},
"model": "strix/qwopus-coder"
}
Qwen3-VL lineage — vision works by adding the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120), shipped in this repo:
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024 # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
<|think_off|> or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.
Why no imatrix (we measured it): a code-weighted importance matrix improved fidelity-to-BF16 (median KL −15%, top-token +0.6 pp) but measurably worsened held-out-code perplexity (+2.6%, significant). For a coder, code-prediction is the task-relevant metric, so we shipped the non-imatrix quant. (On Qwen3-Coder-Next the imatrix was a clean win — it's model-dependent.)
This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. We swept the same rocmfp4 levers we mapped in detail on the base Qwen3.6-27B (embedding precision, output-head precision, fast single-scale vs all-dual-scale body) and the frontier landed in the same place for the coder: an all-dual-scale body trims worst-case token divergences only inside the measurement noise while costing decode speed, and top-token agreement is tied — so greedy output is effectively identical and the fast single-scale body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one quality lever that's actually felt; we keep full f16 embeddings.
So the recipe is the same one the base-model sweep settled on — fast single-scale body + f16 embeddings + Q6 output head — applied here over Jackrong's tuned coder, and shipped non-imatrix (above) because that's what wins on code-prediction. See the base 27B card's §05 for the full lever-by-lever sweep, KL/decode frontier table, and the format-limit discussion; we don't re-print its numbers here.
# from Jackrong's BF16+MTP GGUF -> ROCmFP4, genuine f16 embeddings, no imatrix
llama-quantize --token-embedding-type f16 \
Qwopus3.6-27B-Coder-MTP-BF16.gguf \
Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16.gguf Q4_0_ROCMFP4_STRIX
# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K \
Qwopus3.6-27B-Coder-MTP-BF16.gguf \
Qwopus3.6-27B-Coder-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf Q4_0_ROCMFP4_STRIX
Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
Derivative quantization — verify the base model's license before redistribution / use.
- Downloads last month
- 6,757
16-bit