PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-40B-DECKARD-MTP
4-BIT ROCmFP4 · GRAFTED MTP SELF-SPECULATIVE DECODE · VISION-CAPABLE · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
SIZE
21.8–23.2 GB
CONTEXT
256 K
DRAFT
GRAFTED MTP n-max 5
EMBEDDINGS
Q8_0
VISION
QWEN3-VL 5120
LICENSE
APACHE-2.0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16" / 16-bit badge — its parser can't read ROCmFP4 and mislabels by the Q8 embeddings. These are ~4.5 bpw 4-bit files; pick by filename in Files and versions.
NOTE // MULTI-HOP COMMUNITY DERIVATIVE. This is a frankenmerge with a grafted MTP head, then 4-bit quantized — several steps removed from any officially-released model. Functionally verified (loads, runs, MTP drafts, ~25 t/s, coherent code); the imatrix benefit is measured (KL/PPL below) but absolute coding/reasoning quality is not benchmarked and there is no comparison vs the full source model. Treat it as a "runs with working MTP on Strix Halo, with a measured imatrix improvement" artifact — reproduce / evaluate before relying on it.
01 · FILES

One file — the best speed/quality balance in ROCmFP4 for this grafted-MTP Deckard merge. It keeps the two quality levers that are actually feltQ8 embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a public-calibration imatrix + the grafted 27B MTP head:

File Size Output head Body quant Pick if
…-STRIX-embQ8-imatrix-headQ6.gguf22.2 GBQ6_Kimatrix · fastthe one build — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body
  • Fast single-scale body (the ★ file): the body bulk uses the faster single-scale q4_0_rocmfp4_fast kernel — the best speed/quality balance point (see §04 for the sweep).
  • imatrix: the body + output head are quantized with an importance matrix (public general + code calibration — not DavidAU's private NEO dataset). It's a free quality bump: identical file size and decode speed, since an imatrix only changes how weights are rounded, not the format. See §05 for the full story + datasets, and §04 for the measured benefit.
  • Output head (output.weight): raised from 4-bit to Q6_K — a more faithful output head at a small decode cost.
  • Q8 embeddings (not f16): the source is Q8_0, so f16 would be fake-f16 bloat — Q8 matches the source precision exactly.

Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector (projection_dim 5120) and chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|>/<|think_on|> + vision).

02 · QUICK START

The base is an uncensored "…Thinking" merge with a code focus (NEO-CODE). Thinking is on by default; toggle it off per-request with "chat_template_kwargs": {"enable_thinking": false} (or --reasoning-format none) if you prefer. Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias deckard-40b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).

Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K — full trained context; the GGUF's n_ctx_train is 262144)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch · CPU threads
-ctk f16 · -ctv f16f16 KV cache — how we run it; drop to q8_0/q4_0 to use less memory
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0Qwen3.6 recommended sampling
--spec-type draft-mtp · --spec-draft-n-max 5grafted MTP head, self-speculative; draft depth 5
--spec-draft-device Vulkan0 · -ngl all · type-k/v f16draft head on Vulkan, fully offloaded, f16 KV
--chat-template-file chat_template.jinjabundled froggeric template (tool calls + think-toggle + vision)
--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}parse reasoning into the deepseek field while keeping cross-turn cache; flip enable_thinking:false per-request to run thinking-off
--jinja --parallel 1 --metrics --no-mmapapply template · single slot · metrics · weights in RAM
03 · VISION

Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). Its projection_dim is 5120, which matches this 40B's hidden size — the depth-merge keeps the 27B's hidden width despite the greater layer count, so the dense Qwen3.6 projector is dimensionally compatible. Shipped in this repo; add both lines to the launch command above:

  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024     # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
NOTE // --image-min-tokens 1024 is the one flag that matters. Without it the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail) — the server even logs a warning at load. Adding it fixes the misreads (verified on this quant: a label misread at default tokens read correctly with the flag). Also: it's a thinking model → for one-shot image Q&A use <|think_off|> or allow enough tokens, else the answer can come back empty.
NOTE // BUNDLED TEMPLATE. This repo includes chat_template.jinja (froggeric's unified Qwen3.6 template) — pass --chat-template-file chat_template.jinja for reliable tool calls, inline <|think_off|>/<|think_on|>, and vision in one template.
04 · PERFORMANCE & QUALITY
DECODE · short-context~25 t/s (Vulkan / Strix Halo)
MTP DRAFTworks on every variant (grafted 27B head)
ABSOLUTE CODING / REASONING EVALnot benchmarked · no comparison vs source model
QUANTIZATIONfast body + Q8 emb + Q6 head + imatrix (measured benefit below)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The recipe — fast single-scale body + Q8 embeddings + Q6 output head + imatrix — keeps the quality levers that are actually felt (faithful embeddings matched to the Q8 source, plus a sharper output head) on the fast body that decodes quickest. We ran the same body-kernel sweep here that we ran on the 27B: an all-dual-scale body trades decode speed for a KL improvement that sits inside the measurement noise, so the fast single-scale body is the right balance point and is the one build we ship.

We don't duplicate the full sweep numbers here — the methodology and the frontier table (KL-vs-reference by build, with the all-dual / selective-higher-precision / dynamic-K-quant comparisons) live on the 27B card (§05), and the conclusion carries over: within ROCmFP4, fast body + faithful embeddings + Q6 head is the optimal balance, and for the last bit of fidelity over throughput a higher-bit GGUF is the better grab (box below). The Deckard-specific reference here is the Q8 source, not BF16 — Deckard's only released full-precision weights are DavidAU's NEO-CODE Q8_0, so that Q8 is our near-lossless anchor.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? DavidAU's NEO-CODE Q6_K / Q8_0 GGUFs (the source for this quant) run on this same fork — higher-bit, more faithful, slower. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, grab one of those.

Does the imatrix help? (measured KL/PPL). Measured on the ROCmFP4 fork (llama-perplexity) over a held-out wiki subset (~20 K tokens), using the Q8 source as the near-lossless reference (PPL 6.1495). Lower = closer to the reference = better:

Metric (vs Q8 ref) embQ8 (no imatrix) embQ8-imatrix Change
Perplexity6.35356.3377recovers ~8% of the 4-bit loss
Median KLD0.024190.02166−10.5%
99th-pct KLD0.35120.3028−14%
RMS Δp6.124%5.740%−6.3%
Same top token as Q889.17%89.94%+0.78 pp

Every robust metric moves the same direction, and the KL improvement holds across the whole distribution (median −10%, tail −14%) — so it's a real, if modest, gain. Magnitude is small (it's 4-bit imatrix), but it's free (same size/speed). Honest caveats: this is a narrow imatrix-vs-non-imatrix comparison on a ~20 K-token corpus (not an absolute quality benchmark), and the MTP draft head is unaffected (it's dormant in a normal forward pass — speculation is output-lossless either way).

NOTE // f16 MTP HEAD = WASH. We also tested an f16 MTP draft head vs the 4-bit head — it measured as a wash (no acceptance gain), so that experimental variant was dropped to keep the repo focused.
05 · BUILD (REPRODUCIBLE)
# 0) build an imatrix on the Q8 source — public calibration (Kalomaze groups_merged + froggeric code/technical),
#    NOT DavidAU's private neo1-v2. Compute on the Q8; the result is portable to llama-quantize.
llama-imatrix -m Qwen3.6-40B-Deck-...-Q8_0.gguf -f calib-general+code+technical.txt -o deckard-40b.imatrix -c 512 -ngl 999

# 1) graft a 27B nextn head into the 40B Q8_0 (raw binary copy) via PiehSoft's inject_mtp_40b.py
python inject_mtp_40b.py \
  --target Qwen3.6-40B-Deck-...-Q8_0.gguf \
  --donor  Qwen3.6-27B-MTP-BF16.gguf \
  --output deckard-40b-mtp.gguf --source-layer 64 --dest-layer 96

# 2) ** GOTCHA ** the inject script adds nextn tensors but does NOT bump block_count.
#    An injected MTP model needs block_count += nextn_predict_layers, or the loader
#    looks for the head one slot too low (e.g. "missing tensor blk.95.nextn..."):
python gguf_set_metadata.py deckard-40b-mtp.gguf qwen35.block_count 97 --force

# 3) THE ONE BUILD (the ★ file): fast single-scale body + Q8 embeddings + Q6 head + imatrix
#    — the best speed/quality balance (§04). --output-tensor-type q6_K raises the lm-head to Q6_K.
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix deckard-40b.imatrix \
  deckard-40b-mtp.gguf  Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Kernels: ~631 tensors on the fast single-scale q4_0_rocmfp4_fast bulk, ~122 on dual-scale q4_0_rocmfp4 (incl. the grafted head).

Why these are re-imatrixed. The base we quantize from is DavidAU's NEO-CODE Q8_0, and yes, that Q8 is itself a NEO imatrix GGUF (built with his neo1-v2 NEO imatrix — it's right in the file's metadata as quantize.imatrix.dataset = neo1-v2.txt). But that NEO imatrix does not help a 4-bit requant, for a simple reason:

  • An imatrix only does real work at low bit-widths, where it decides which weights to protect when bits are scarce. At Q8_0 it is essentially inert: Q8 is near-lossless and barely uses importance weighting at all, so a "NEO Q8" and a plain Q8 are nearly the same weights.
  • So the NEO benefit does not carry through when you requantize that Q8 down to 4-bit. The place an imatrix actually helps is during the 4-bit step — and our original ROCmFP4 files were quantized from the Q8 with no imatrix at that step.

The -imatrix variants fix exactly that: a fresh importance matrix, computed and applied during the 4-bit ROCmFP4 quantization (the bit-width where it counts). The calibration is not DavidAU's NEO dataset — his neo1-v2 is private and we do not have it. We used a standard, public mix instead:

  • Kalomaze's groups_merged (general) + froggeric's code + technical sets, all from the froggeric/imatrix dataset.
  • This is the same lineage Bartowski-style imatrix quants are built on (Kalomaze's groups_merged is the root of Dampf's calibration_datav3), just weighted a little more toward code since Deckard is a coder.

So the honest one-liner: these are imatrix quants of DavidAU's NEO-CODE Deckard — his NEO weights as the source, a public-calibration imatrix applied by us at the 4-bit step. We make no claim to his NEO recipe; the -imatrix name reflects what we added, not his brand.

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

This file is a stack of other people's work — credit where due. The MTP head is not native to the Deckard merge — it's grafted; the draft predictions are "borrowed" from a 27B and happen to align well, they were not trained on this model.

BASE MODELDavidAU/Qwen3.6-40B-…-Deckard-Heretic-Uncensored-Thinking — a ~40B Qwen3.6 frankenmerge (Claude-Opus-distilled, uncensored thinking/coder). Source used here: DavidAU's NEO-CODE Q8_0, specifically Qwen3.6-40B-Deck-Opus-NEO-CODE-HERE-2T-OT-HIGH-Q8_0.gguf from his …-NEO-CODE-Di-IMatrix-MAX-GGUF repo (verified byte-identical: sha256 4479dd10…a760de). It carries his NEO imatrix at Q8 — see §05 for why that doesn't help a 4-bit requant.
MTP-INJECTION METHODPiehSoft's inject_mtp_40b.py (from PiehSoft/Qwen3.6-40B-Deckard-MTP-Q6_K), which raw-copies a 27B nextn head into the 40B.
MTP DONOR HEADthe nextn block from a Qwen3.6-27B MTP model — this build uses a BF16 base-27B-MTP donor. (Donor choice is output-lossless — it only affects draft speed, and the candidate donors measured as a wash in sustained acceptance.)
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (llama.cpp, MIT)
CHAT TEMPLATEfroggeric/Qwen-Fixed-Chat-Templates

Derivative of the Deckard merge (Qwen3.6 lineage, Apache-2.0). Inherits the base model's terms — verify them before redistribution / use. The MTP graft and quant add no new restrictions.

Downloads last month
7,614
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-40B-Deckard-MTP-ROCmFP4-GGUF

Collection including plunderstruck/Qwen3.6-40B-Deckard-MTP-ROCmFP4-GGUF