plunderstruck/Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-GGUF

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-27B-OBLITERATED-MTP
4-BIT ROCmFP4 · ABLITERATED / UNCENSORED · GRAFTED MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.8 BPW

      SIZE
~15 GB

      CONTEXT
262 K

    

      DRAFT
MTP n-max 5 (GRAFTED)

      VISION
QWEN3-VL

      BACKEND
VULKAN0

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.8 bpw 4-bit files; pick by filename.

NOTE // Uncensored: this is an abliterated model — the refusal direction was removed (with source-weight interpolation to retain capability) upstream by OBLITERATUS. It will answer prompts a stock Qwen3.6 would refuse. Abliteration is upstream work; verify behavior before relying on it.

01 · FILES

File	Size	Output head	Pick if
`…-STRIX-embF16-imatrix-headQ6.gguf` ★	~15.5 GB	Q6_K	the one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a general+code-calibrated imatrix + the grafted MTP draft head. Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector and chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|> + vision).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
  --alias obliterated-27b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 recommended sampling
`--spec-type draft-mtp · --spec-draft-n-max 5`	grafted MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 KV (matches the main model)
`--chat-template-file chat_template.jinja`	bundled froggeric template (tool calls + think-toggle + vision)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	thinking enabled, deepseek-style parsing; keep cross-turn cache
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM

03 · VISION

Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). It's the Qwen3-VL projector (projection_dim 5120, matches this model's hidden size), shipped in this repo:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail

Without --image-min-tokens 1024 the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail — the server even logs a warning at load). Verified on this model: a code label misread at default tokens read correctly once the flag was set.

NOTE // thinking model → for one-shot image Q&A use <|think_off|> (the bundled chat_template.jinja) or allow enough tokens, else the answer can come back empty. With --mmproj loaded the server disables the --cache-reuse feature (it logs "cache_reuse is not supported by multimodal"); whether ordinary cross-turn caching still helps with vision isn't something we've benchmarked.

04 · PERFORMANCE & QUALITY

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The two quality levers that are actually felt — genuine f16 token embeddings and a Q6_K output head — sit on the fast single-scale q4_0_rocmfp4_fast body. The alternatives we'd otherwise reach for (an all-dual-scale body, selective higher-precision tensors) buy a KL improvement that sits inside the measurement noise while costing decode speed — so the fast body is the right point. A leaner Q5-embedding build is a few tok/s faster but degrades the one lever you notice; we keep full f16 embeddings.

We ran this exact sweep in full on the sibling Qwen3.6-27B base card — every rocmfp4 lever measured by KL divergence vs the BF16 reference plus llama-bench decode. The frontier there is the same shape it is here: the fast body + f16 emb + Q6 head is the balance point, all-dual-scale and selective higher-precision land inside the noise, and the dynamic K-quant is the fidelity ceiling that rocmfp4's FP4 can't out-allocate. See that card for the numbers and the full experiments table; OBLITERATED follows the same recipe.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? There's no Unsloth UD-quant of this abliterated model, so for the last bit of fidelity grab a Q6_K / Q8 GGUF of the base abliterated model from OBLITERATUS/Qwen3.6-27B-OBLITERATED. Those higher-bit GGUFs run on this same fork — trading lower KL for slower decode. We optimize for throughput in ROCmFP4; if you want fidelity over speed, that's the one to grab.

Grafted MTP head — measured. OBLITERATED ships no MTP head, so we transplanted a nextn block from a Qwen3.6-27B-MTP BF16 donor (output-lossless — the draft head only affects speed, never the tokens). Because OBLITERATED is abliterated from that same base, the borrowed head tracks it closely. Measured on the ROCmFP4 fork (llama-server --spec-type draft-mtp, f16 KV on target and drafter, 4 prompt types, temp 0.6):

Content	Decode t/s (MTP)	t/s (no MTP)	Draft acceptance
math / reasoning	35.1	14.0	88.1%
technical	29.5	14.1	69.1%
code	28.8	14.1	67.5%
prose / creative	24.6	14.0	52.2%
average	29.5	14.1	67.7%

~2.1× faster decode (14.1 → 29.5 t/s) at 67.7% token acceptance — high on structured/predictable text, lower on freeform prose, the profile of a borrowed (not natively-trained) head. KV is f16 on both the main model and the draft; the draft head is kept at 4-bit ROCmFP4.

The imatrix WINS here — measured. Quantized with an importance matrix from a public general+code calibration mix (Kalomaze groups_merged + froggeric code/technical, via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out general slice, imatrix vs no-imatrix:

Metric (vs BF16, held-out general)	No-imatrix	Imatrix	Change
Perplexity	+3.08%	+2.58%	recovers ~16% of the 4-bit loss
Median KLD	0.02070	0.01843	−11%
Mean KLD	0.04239	0.03961	−6.6%
99th-pct KLD	0.3729	0.3298	−12%
RMS Δp	6.37%	6.13%	−3.7%
Same top token as BF16	90.59%	91.06%	+0.47 pp

For OBLITERATED the imatrix is a clean win on every robust metric — it behaves like its base Qwen3.6-27B, not like the dense Qwopus-Coder (where the same recipe worsened code-PPL — but that was a code metric; this is a general model measured on general text). Always measure; we did.

NOTE // Quality scope: the KL/PPL above is a fidelity-vs-BF16 measurement on ~20 K tokens of held-out general text, not an absolute benchmark. The MTP head is borrowed (not trained on this model) — output-lossless, affecting speed only. Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive. Not native FP4 tensor-core execution.

05 · BUILD (REPRODUCIBLE)

# 0) convert the abliterated safetensors -> BF16 GGUF
python convert_hf_to_gguf.py OBLITERATED-27b/ --outtype bf16 --outfile OBLITERATED-BF16.gguf

# 1) ** GOTCHA ** OBLITERATED's config declares `mtp_num_hidden_layers: 1` but ships NO MTP weights,
#    so the convert labels it block_count=65 with only 64 real layers -> "missing tensor blk.64.*" on load.
#    The real model is 64 layers; the donor's nextn becomes the new blk.64.

# 2) graft a Qwen3.6-27B nextn head onto blk.64, then set block_count = 65
python inject_mtp_40b.py --target OBLITERATED-BF16.gguf --donor Qwen3.6-27B-MTP-BF16.gguf \
  --output OBLITERATED-MTP-BF16.gguf --source-layer 64 --dest-layer 64
python gguf_set_metadata.py OBLITERATED-MTP-BF16.gguf qwen35.block_count 65 --force

# 3) imatrix on the grafted BF16, then quant -> ROCmFP4 with genuine f16 embeddings
llama-imatrix -m OBLITERATED-MTP-BF16.gguf -f general+code-calib.txt -o obliterated.imatrix -c 512 -ngl 999
llama-quantize --token-embedding-type f16 --imatrix obliterated.imatrix \
  OBLITERATED-MTP-BF16.gguf  Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

# headQ6 variant adds the Q6_K output head
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix obliterated.imatrix \
  OBLITERATED-MTP-BF16.gguf  Qwen3.6-27B-OBLITERATED-MTP-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

this ROCmFP4 quant  ──quantized──▶  OBLITERATUS/Qwen3.6-27B-OBLITERATED  ──abliterated──▶  Qwen/Qwen3.6-27B

A 4-bit Strix-Halo quant of OBLITERATUS's abliterated 27B. The abliteration (refusal-direction removal + source interpolation) is upstream work; we add the ROCmFP4 quant, genuine f16 embeddings, and a grafted MTP head.

BASE MODEL	OBLITERATUS/Qwen3.6-27B-OBLITERATED (Apache-2.0) · abliterated from Qwen/Qwen3.6-27B (Qwen team)
MTP DONOR	a Qwen3.6-27B-MTP `nextn` head (graft is output-lossless)
CALIBRATION	Kalomaze `groups_merged` + froggeric `code`/`technical` via froggeric/imatrix
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (llama.cpp, MIT)