▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT |
PRECISION ~4.8 BPW |
SIZE ~15 GB |
CONTEXT 262 K |
DRAFT MTP n-max 5 (GRAFTED) |
VISION QWEN3-VL |
BACKEND VULKAN0 |
LICENSE APACHE-2.0 |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a general+code-calibrated imatrix + the grafted MTP draft head. Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector and chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|> + vision).
Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
--alias obliterated-27b-mtp \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 262144 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024
The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).
Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120, matches this model's hidden size), shipped in this repo:
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024 # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
Without --image-min-tokens 1024 the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail — the server even logs a warning at load). Verified on this model: a code label misread at default tokens read correctly once the flag was set.
<|think_off|> (the bundled chat_template.jinja) or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.
This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The two quality levers that are actually felt — genuine f16 token embeddings and a Q6_K output head — sit on the fast single-scale q4_0_rocmfp4_fast body. The alternatives we'd otherwise reach for (an all-dual-scale body, selective higher-precision tensors) buy a KL improvement that sits inside the measurement noise while costing decode speed — so the fast body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one lever you notice; we keep full f16 embeddings.
We ran this exact sweep in full on the sibling Qwen3.6-27B base card — every rocmfp4 lever measured by KL divergence vs the BF16 reference plus llama-bench decode. The frontier there is the same shape it is here: the fast body + f16 emb + Q6 head is the balance point, all-dual-scale and selective higher-precision land inside the noise, and the dynamic K-quant is the fidelity ceiling that rocmfp4's FP4 can't out-allocate. See that card for the numbers and the full experiments table; OBLITERATED follows the same recipe.
Grafted MTP head — measured. OBLITERATED ships no MTP head, so we transplanted a nextn block from a Qwen3.6-27B-MTP BF16 donor (output-lossless — the draft head only affects speed, never the tokens). Because OBLITERATED is abliterated from that same base, the borrowed head tracks it closely. Measured on the ROCmFP4 fork (llama-server --spec-type draft-mtp, f16 KV on target and drafter, 4 prompt types, temp 0.6):
~2.1× faster decode (14.1 → 29.5 t/s) at 67.7% token acceptance — high on structured/predictable text, lower on freeform prose, the profile of a borrowed (not natively-trained) head. KV is f16 on both the main model and the draft; the draft head is kept at 4-bit ROCmFP4.
The imatrix WINS here — measured. Quantized with an importance matrix from a public general+code calibration mix (Kalomaze groups_merged + froggeric code/technical, via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out general slice, imatrix vs no-imatrix:
For OBLITERATED the imatrix is a clean win on every robust metric — it behaves like its base Qwen3.6-27B, not like the dense Qwopus-Coder (where the same recipe worsened code-PPL — but that was a code metric; this is a general model measured on general text). Always measure; we did.
# 0) convert the abliterated safetensors -> BF16 GGUF
python convert_hf_to_gguf.py OBLITERATED-27b/ --outtype bf16 --outfile OBLITERATED-BF16.gguf
# 1) ** GOTCHA ** OBLITERATED's config declares `mtp_num_hidden_layers: 1` but ships NO MTP weights,
# so the convert labels it block_count=65 with only 64 real layers -> "missing tensor blk.64.*" on load.
# The real model is 64 layers; the donor's nextn becomes the new blk.64.
# 2) graft a Qwen3.6-27B nextn head onto blk.64, then set block_count = 65
python inject_mtp_40b.py --target OBLITERATED-BF16.gguf --donor Qwen3.6-27B-MTP-BF16.gguf \
--output OBLITERATED-MTP-BF16.gguf --source-layer 64 --dest-layer 64
python gguf_set_metadata.py OBLITERATED-MTP-BF16.gguf qwen35.block_count 65 --force
# 3) imatrix on the grafted BF16, then quant -> ROCmFP4 with genuine f16 embeddings
llama-imatrix -m OBLITERATED-MTP-BF16.gguf -f general+code-calib.txt -o obliterated.imatrix -c 512 -ngl 999
llama-quantize --token-embedding-type f16 --imatrix obliterated.imatrix \
OBLITERATED-MTP-BF16.gguf Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix.gguf Q4_0_ROCMFP4_STRIX
# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix obliterated.imatrix \
OBLITERATED-MTP-BF16.gguf Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
this ROCmFP4 quant ──quantized──▶ OBLITERATUS/Qwen3.6-27B-OBLITERATED ──abliterated──▶ Qwen/Qwen3.6-27B
A 4-bit Strix-Halo quant of OBLITERATUS's abliterated 27B. The abliteration (refusal-direction removal + source interpolation) is upstream work; we add the ROCmFP4 quant, genuine f16 embeddings, and a grafted MTP head.
Sibling ROCmFP4 Strix Halo models — Qwen3.6-27B-MTP · Qwen3.6-35B-A3B-MTP · Qwopus3.6-27B-Coder-MTP · Qwen3.6-40B-Deckard-MTP · Qwen3-Coder-Next · Nex-N2-mini
Derivative quantization — verify the base model's license before redistribution / use.
- Downloads last month
- 4,408
16-bit