PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-27B-OBLITERATED-MTP
4-BIT ROCmFP4 · ABLITERATED / UNCENSORED · GRAFTED MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.8 BPW
SIZE
~15 GB
CONTEXT
262 K
DRAFT
MTP n-max 5 (GRAFTED)
VISION
QWEN3-VL
BACKEND
VULKAN0
LICENSE
APACHE-2.0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.8 bpw 4-bit files; pick by filename.
NOTE // Uncensored: this is an abliterated model — the refusal direction was removed (with source-weight interpolation to retain capability) upstream by OBLITERATUS. It will answer prompts a stock Qwen3.6 would refuse. Abliteration is upstream work; verify behavior before relying on it.
01 · FILES
File Size Output head Pick if
…-STRIX-embF16-imatrix-headQ6.gguf~15.5 GBQ6_Kthe one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a general+code-calibrated imatrix + the grafted MTP draft head. Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector and chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|> + vision).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
  --alias obliterated-27b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).

Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch · CPU threads
-ctk f16 · -ctv f16f16 KV cache — how we run it; drop to q8_0/q4_0 to use less memory
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0Qwen3.6 recommended sampling
--spec-type draft-mtp · --spec-draft-n-max 5grafted MTP head, self-speculative; draft depth 5
--spec-draft-device Vulkan0 · -ngl all · type-k/v f16draft head on Vulkan, fully offloaded, f16 KV (matches the main model)
--chat-template-file chat_template.jinjabundled froggeric template (tool calls + think-toggle + vision)
--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}thinking enabled, deepseek-style parsing; keep cross-turn cache
--jinja --parallel 1 --metrics --no-mmapapply template · single slot · metrics · weights in RAM
03 · VISION

Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120, matches this model's hidden size), shipped in this repo:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail

Without --image-min-tokens 1024 the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail — the server even logs a warning at load). Verified on this model: a code label misread at default tokens read correctly once the flag was set.

NOTE // thinking model → for one-shot image Q&A use <|think_off|> (the bundled chat_template.jinja) or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.
04 · PERFORMANCE & QUALITY

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The two quality levers that are actually felt — genuine f16 token embeddings and a Q6_K output head — sit on the fast single-scale q4_0_rocmfp4_fast body. The alternatives we'd otherwise reach for (an all-dual-scale body, selective higher-precision tensors) buy a KL improvement that sits inside the measurement noise while costing decode speed — so the fast body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one lever you notice; we keep full f16 embeddings.

We ran this exact sweep in full on the sibling Qwen3.6-27B base card — every rocmfp4 lever measured by KL divergence vs the BF16 reference plus llama-bench decode. The frontier there is the same shape it is here: the fast body + f16 emb + Q6 head is the balance point, all-dual-scale and selective higher-precision land inside the noise, and the dynamic K-quant is the fidelity ceiling that rocmfp4's FP4 can't out-allocate. See that card for the numbers and the full experiments table; OBLITERATED follows the same recipe.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? There's no Unsloth UD-quant of this abliterated model, so for the last bit of fidelity grab a Q6_K / Q8 GGUF of the base abliterated model from OBLITERATUS/Qwen3.6-27B-OBLITERATED. Those higher-bit GGUFs run on this same fork — trading lower KL for slower decode. We optimize for throughput in ROCmFP4; if you want fidelity over speed, that's the one to grab.

Grafted MTP head — measured. OBLITERATED ships no MTP head, so we transplanted a nextn block from a Qwen3.6-27B-MTP BF16 donor (output-lossless — the draft head only affects speed, never the tokens). Because OBLITERATED is abliterated from that same base, the borrowed head tracks it closely. Measured on the ROCmFP4 fork (llama-server --spec-type draft-mtp, f16 KV on target and drafter, 4 prompt types, temp 0.6):

Content Decode t/s (MTP) t/s (no MTP) Draft acceptance
math / reasoning35.114.088.1%
technical29.514.169.1%
code28.814.167.5%
prose / creative24.614.052.2%
average29.514.167.7%

~2.1× faster decode (14.1 → 29.5 t/s) at 67.7% token acceptance — high on structured/predictable text, lower on freeform prose, the profile of a borrowed (not natively-trained) head. KV is f16 on both the main model and the draft; the draft head is kept at 4-bit ROCmFP4.

The imatrix WINS here — measured. Quantized with an importance matrix from a public general+code calibration mix (Kalomaze groups_merged + froggeric code/technical, via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out general slice, imatrix vs no-imatrix:

Metric (vs BF16, held-out general) No-imatrix Imatrix Change
Perplexity+3.08%+2.58%recovers ~16% of the 4-bit loss
Median KLD0.020700.01843−11%
Mean KLD0.042390.03961−6.6%
99th-pct KLD0.37290.3298−12%
RMS Δp6.37%6.13%−3.7%
Same top token as BF1690.59%91.06%+0.47 pp

For OBLITERATED the imatrix is a clean win on every robust metric — it behaves like its base Qwen3.6-27B, not like the dense Qwopus-Coder (where the same recipe worsened code-PPL — but that was a code metric; this is a general model measured on general text). Always measure; we did.

NOTE // Quality scope: the KL/PPL above is a fidelity-vs-BF16 measurement on ~20 K tokens of held-out general text, not an absolute benchmark. The MTP head is borrowed (not trained on this model) — output-lossless, affecting speed only. Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive. Not native FP4 tensor-core execution.
05 · BUILD (REPRODUCIBLE)
# 0) convert the abliterated safetensors -> BF16 GGUF
python convert_hf_to_gguf.py OBLITERATED-27b/ --outtype bf16 --outfile OBLITERATED-BF16.gguf

# 1) ** GOTCHA ** OBLITERATED's config declares `mtp_num_hidden_layers: 1` but ships NO MTP weights,
#    so the convert labels it block_count=65 with only 64 real layers -> "missing tensor blk.64.*" on load.
#    The real model is 64 layers; the donor's nextn becomes the new blk.64.

# 2) graft a Qwen3.6-27B nextn head onto blk.64, then set block_count = 65
python inject_mtp_40b.py --target OBLITERATED-BF16.gguf --donor Qwen3.6-27B-MTP-BF16.gguf \
  --output OBLITERATED-MTP-BF16.gguf --source-layer 64 --dest-layer 64
python gguf_set_metadata.py OBLITERATED-MTP-BF16.gguf qwen35.block_count 65 --force

# 3) imatrix on the grafted BF16, then quant -> ROCmFP4 with genuine f16 embeddings
llama-imatrix -m OBLITERATED-MTP-BF16.gguf -f general+code-calib.txt -o obliterated.imatrix -c 512 -ngl 999
llama-quantize --token-embedding-type f16 --imatrix obliterated.imatrix \
  OBLITERATED-MTP-BF16.gguf  Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix obliterated.imatrix \
  OBLITERATED-MTP-BF16.gguf  Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS
this ROCmFP4 quant  ──quantized──▶  OBLITERATUS/Qwen3.6-27B-OBLITERATED  ──abliterated──▶  Qwen/Qwen3.6-27B

A 4-bit Strix-Halo quant of OBLITERATUS's abliterated 27B. The abliteration (refusal-direction removal + source interpolation) is upstream work; we add the ROCmFP4 quant, genuine f16 embeddings, and a grafted MTP head.

BASE MODELOBLITERATUS/Qwen3.6-27B-OBLITERATED (Apache-2.0) · abliterated from Qwen/Qwen3.6-27B (Qwen team)
MTP DONORa Qwen3.6-27B-MTP nextn head (graft is output-lossless)
CALIBRATIONKalomaze groups_merged + froggeric code/technical via froggeric/imatrix
CHAT TEMPLATEfroggeric/Qwen-Fixed-Chat-Templates
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (llama.cpp, MIT)

Sibling ROCmFP4 Strix Halo modelsQwen3.6-27B-MTP · Qwen3.6-35B-A3B-MTP · Qwopus3.6-27B-Coder-MTP · Qwen3.6-40B-Deckard-MTP · Qwen3-Coder-Next · Nex-N2-mini

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month
4,408
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(4)
this model

Collection including plunderstruck/Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-GGUF