plunderstruck/Qwen3.6-40B-Deckard-MTP-ROCmFP4-GGUF

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-40B-DECKARD-MTP
4-BIT ROCmFP4 · GRAFTED MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      SIZE
21.8–23.2 GB

      CONTEXT
256 K

    

      DRAFT
GRAFTED MTP n-max 5

      EMBEDDINGS
Q8_0

      VISION
QWEN3-VL 5120

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16" / 16-bit badge — its parser can't read ROCmFP4 and mislabels by the Q8 embeddings. These are ~4.5 bpw 4-bit files; pick by filename in Files and versions.

NOTE // MULTI-HOP COMMUNITY DERIVATIVE. This is a frankenmerge with a grafted MTP head, then 4-bit quantized — several steps removed from any officially-released model. Functionally verified (loads, runs, MTP drafts, ~25 t/s, coherent code); the imatrix benefit is measured (KL/PPL below) but absolute coding/reasoning quality is not benchmarked and there is no comparison vs the full source model. Treat it as a "runs with working MTP on Strix Halo, with a measured imatrix improvement" artifact — reproduce / evaluate before relying on it.

01 · FILES

One file — the best speed/quality balance in ROCmFP4 for this grafted-MTP Deckard merge. It keeps the two quality levers that are actually felt — Q8 embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a public-calibration imatrix + the grafted 27B MTP head:

File	Size	Output head	Body quant	Pick if
`…-STRIX-embQ8-imatrix-headQ6.gguf` ★	22.2 GB	Q6_K	imatrix · fast	the one build — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body

Fast single-scale body (the ★ file): the body bulk uses the faster single-scale q4_0_rocmfp4_fast kernel — the best speed/quality balance point (see §04 for the sweep).
imatrix: the body + output head are quantized with an importance matrix (public general + code calibration — not DavidAU's private NEO dataset). It's a free quality bump: identical file size and decode speed, since an imatrix only changes how weights are rounded, not the format. See §05 for the full story + datasets, and §04 for the measured benefit.
Output head (output.weight): raised from 4-bit to Q6_K — a more faithful output head at a small decode cost.
Q8 embeddings (not f16): the source is Q8_0, so f16 would be fake-f16 bloat — Q8 matches the source precision exactly.

Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector (projection_dim 5120) and chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|>/<|think_on|> + vision).

02 · QUICK START

The base is an uncensored "…Thinking" merge with a code focus (NEO-CODE). Thinking is on by default; toggle it off per-request with "chat_template_kwargs": {"enable_thinking": false} (or --reasoning-format none) if you prefer. Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias deckard-40b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K — full trained context; the GGUF's `n_ctx_train` is 262144)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 recommended sampling
`--spec-type draft-mtp · --spec-draft-n-max 5`	grafted MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 KV
`--chat-template-file chat_template.jinja`	bundled froggeric template (tool calls + think-toggle + vision)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	parse reasoning into the deepseek field while keeping cross-turn cache; flip `enable_thinking:false` per-request to run thinking-off
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM

03 · VISION

Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). Its projection_dim is 5120, which matches this 40B's hidden size — the depth-merge keeps the 27B's hidden width despite the greater layer count, so the dense Qwen3.6 projector is dimensionally compatible. Shipped in this repo; add both lines to the launch command above:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail

NOTE // --image-min-tokens 1024 is the one flag that matters. Without it the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail) — the server even logs a warning at load. Adding it fixes the misreads (verified on this quant: a label misread at default tokens read correctly with the flag). Also: it's a thinking model → for one-shot image Q&A use <|think_off|> or allow enough tokens, else the answer can come back empty.

NOTE // BUNDLED TEMPLATE. This repo includes chat_template.jinja (froggeric's unified Qwen3.6 template) — pass --chat-template-file chat_template.jinja for reliable tool calls, inline <|think_off|>/<|think_on|>, and vision in one template.

04 · PERFORMANCE & QUALITY

DECODE · short-context	~25 t/s (Vulkan / Strix Halo)
MTP DRAFT	works on every variant (grafted 27B head)
ABSOLUTE CODING / REASONING EVAL	not benchmarked · no comparison vs source model
QUANTIZATION	fast body + Q8 emb + Q6 head + imatrix (measured benefit below)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The recipe — fast single-scale body + Q8 embeddings + Q6 output head + imatrix — keeps the quality levers that are actually felt (faithful embeddings matched to the Q8 source, plus a sharper output head) on the fast body that decodes quickest. We ran the same body-kernel sweep here that we ran on the 27B: an all-dual-scale body trades decode speed for a KL improvement that sits inside the measurement noise, so the fast single-scale body is the right balance point and is the one build we ship.

We don't duplicate the full sweep numbers here — the methodology and the frontier table (KL-vs-reference by build, with the all-dual / selective-higher-precision / dynamic-K-quant comparisons) live on the 27B card (§05), and the conclusion carries over: within ROCmFP4, fast body + faithful embeddings + Q6 head is the optimal balance, and for the last bit of fidelity over throughput a higher-bit GGUF is the better grab (box below). The Deckard-specific reference here is the Q8 source, not BF16 — Deckard's only released full-precision weights are DavidAU's NEO-CODE Q8_0, so that Q8 is our near-lossless anchor.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? DavidAU's NEO-CODE Q6_K / Q8_0 GGUFs (the source for this quant) run on this same fork — higher-bit, more faithful, slower. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, grab one of those.

Does the imatrix help? (measured KL/PPL). Measured on the ROCmFP4 fork (llama-perplexity) over a held-out wiki subset (~20 K tokens), using the Q8 source as the near-lossless reference (PPL 6.1495). Lower = closer to the reference = better:

Metric (vs Q8 ref)	`embQ8` (no imatrix)	`embQ8-imatrix`	Change
Perplexity	6.3535	6.3377	recovers ~8% of the 4-bit loss
Median KLD	0.02419	0.02166	−10.5%
99th-pct KLD	0.3512	0.3028	−14%
RMS Δp	6.124%	5.740%	−6.3%
Same top token as Q8	89.17%	89.94%	+0.78 pp

Every robust metric moves the same direction, and the KL improvement holds across the whole distribution (median −10%, tail −14%) — so it's a real, if modest, gain. Magnitude is small (it's 4-bit imatrix), but it's free (same size/speed). Honest caveats: this is a narrow imatrix-vs-non-imatrix comparison on a ~20 K-token corpus (not an absolute quality benchmark), and the MTP draft head is unaffected (it's dormant in a normal forward pass — speculation is output-lossless either way).

NOTE // f16 MTP HEAD = WASH. We also tested an f16 MTP draft head vs the 4-bit head — it measured as a wash (no acceptance gain), so that experimental variant was dropped to keep the repo focused.

05 · BUILD (REPRODUCIBLE)

# 0) build an imatrix on the Q8 source — public calibration (Kalomaze groups_merged + froggeric code/technical),
#    NOT DavidAU's private neo1-v2. Compute on the Q8; the result is portable to llama-quantize.
llama-imatrix -m Qwen3.6-40B-Deck-...-Q8_0.gguf -f calib-general+code+technical.txt -o deckard-40b.imatrix -c 512 -ngl 999

# 1) graft a 27B nextn head into the 40B Q8_0 (raw binary copy) via PiehSoft's inject_mtp_40b.py
python inject_mtp_40b.py \
  --target Qwen3.6-40B-Deck-...-Q8_0.gguf \
  --donor  Qwen3.6-27B-MTP-BF16.gguf \
  --output deckard-40b-mtp.gguf --source-layer 64 --dest-layer 96

# 2) ** GOTCHA ** the inject script adds nextn tensors but does NOT bump block_count.
#    An injected MTP model needs block_count += nextn_predict_layers, or the loader
#    looks for the head one slot too low (e.g. "missing tensor blk.95.nextn..."):
python gguf_set_metadata.py deckard-40b-mtp.gguf qwen35.block_count 97 --force

# 3) THE ONE BUILD (the ★ file): fast single-scale body + Q8 embeddings + Q6 head + imatrix
#    — the best speed/quality balance (§04). --output-tensor-type q6_K raises the lm-head to Q6_K.
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix deckard-40b.imatrix \
  deckard-40b-mtp.gguf  Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Kernels: ~631 tensors on the fast single-scale q4_0_rocmfp4_fast bulk, ~122 on dual-scale q4_0_rocmfp4 (incl. the grafted head).

Why these are re-imatrixed. The base we quantize from is DavidAU's NEO-CODE Q8_0, and yes, that Q8 is itself a NEO imatrix GGUF (built with his neo1-v2 NEO imatrix — it's right in the file's metadata as quantize.imatrix.dataset = neo1-v2.txt). But that NEO imatrix does not help a 4-bit requant, for a simple reason:

An imatrix only does real work at low bit-widths, where it decides which weights to protect when bits are scarce. At Q8_0 it is essentially inert: Q8 is near-lossless and barely uses importance weighting at all, so a "NEO Q8" and a plain Q8 are nearly the same weights.
So the NEO benefit does not carry through when you requantize that Q8 down to 4-bit. The place an imatrix actually helps is during the 4-bit step — and our original ROCmFP4 files were quantized from the Q8 with no imatrix at that step.

The -imatrix variants fix exactly that: a fresh importance matrix, computed and applied during the 4-bit ROCmFP4 quantization (the bit-width where it counts). The calibration is not DavidAU's NEO dataset — his neo1-v2 is private and we do not have it. We used a standard, public mix instead:

Kalomaze's groups_merged (general) + froggeric's code + technical sets, all from the froggeric/imatrix dataset.
This is the same lineage Bartowski-style imatrix quants are built on (Kalomaze's groups_merged is the root of Dampf's calibration_datav3), just weighted a little more toward code since Deckard is a coder.

So the honest one-liner: these are imatrix quants of DavidAU's NEO-CODE Deckard — his NEO weights as the source, a public-calibration imatrix applied by us at the 4-bit step. We make no claim to his NEO recipe; the -imatrix name reflects what we added, not his brand.

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

This file is a stack of other people's work — credit where due. The MTP head is not native to the Deckard merge — it's grafted; the draft predictions are "borrowed" from a 27B and happen to align well, they were not trained on this model.

BASE MODEL	DavidAU/Qwen3.6-40B-…-Deckard-Heretic-Uncensored-Thinking — a ~40B Qwen3.6 frankenmerge (Claude-Opus-distilled, uncensored thinking/coder). Source used here: DavidAU's NEO-CODE Q8_0, specifically `Qwen3.6-40B-Deck-Opus-NEO-CODE-HERE-2T-OT-HIGH-Q8_0.gguf` from his …-NEO-CODE-Di-IMatrix-MAX-GGUF repo (verified byte-identical: sha256 `4479dd10…a760de`). It carries his NEO imatrix at Q8 — see §05 for why that doesn't help a 4-bit requant.
MTP-INJECTION METHOD	PiehSoft's `inject_mtp_40b.py` (from PiehSoft/Qwen3.6-40B-Deckard-MTP-Q6_K), which raw-copies a 27B `nextn` head into the 40B.
MTP DONOR HEAD	the `nextn` block from a Qwen3.6-27B MTP model — this build uses a BF16 base-27B-MTP donor. (Donor choice is output-lossless — it only affects draft speed, and the candidate donors measured as a wash in sustained acceptance.)
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (llama.cpp, MIT)
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates

Derivative of the Deckard merge (Qwen3.6 lineage, Apache-2.0). Inherits the base model's terms — verify them before redistribution / use. The MTP graft and quant add no new restrictions.