▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT |
PRECISION ~4.5 BPW |
SIZE 21.8–23.2 GB |
CONTEXT 256 K |
DRAFT GRAFTED MTP n-max 5 |
EMBEDDINGS Q8_0 |
VISION QWEN3-VL 5120 |
LICENSE APACHE-2.0 |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.
One file — the best speed/quality balance in ROCmFP4 for this grafted-MTP Deckard merge. It keeps the two quality levers that are actually felt — Q8 embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a public-calibration imatrix + the grafted 27B MTP head:
- Fast single-scale body (the ★ file): the body bulk uses the faster single-scale
q4_0_rocmfp4_fastkernel — the best speed/quality balance point (see §04 for the sweep). - imatrix: the body + output head are quantized with an importance matrix (public general + code calibration — not DavidAU's private NEO dataset). It's a free quality bump: identical file size and decode speed, since an imatrix only changes how weights are rounded, not the format. See §05 for the full story + datasets, and §04 for the measured benefit.
- Output head (
output.weight): raised from 4-bit toQ6_K— a more faithful output head at a small decode cost. - Q8 embeddings (not f16): the source is Q8_0, so f16 would be fake-f16 bloat — Q8 matches the source precision exactly.
Repo also bundles the mmproj-F32.gguf Qwen3-VL vision projector (projection_dim 5120) and chat_template.jinja — froggeric's unified Qwen3.6 template (tool calls + inline <|think_off|>/<|think_on|> + vision).
The base is an uncensored "…Thinking" merge with a code focus (NEO-CODE). Thinking is on by default; toggle it off per-request with "chat_template_kwargs": {"enable_thinking": false} (or --reasoning-format none) if you prefer. Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
--alias deckard-40b-mtp \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 262144 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024
The last two lines enable vision — the mmproj-F32.gguf Qwen3-VL projector is bundled in this repo (projection_dim 5120); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set (see §03).
Qwen3-VL lineage — vision works via the bundled mmproj-F32.gguf projector with --mmproj (same LLM GGUF, no separate vision model). Its projection_dim is 5120, which matches this 40B's hidden size — the depth-merge keeps the 27B's hidden width despite the greater layer count, so the dense Qwen3.6 projector is dimensionally compatible. Shipped in this repo; add both lines to the launch command above:
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024 # REQUIRED — Qwen-VL needs >=1024 image tokens or it misreads fine detail
--image-min-tokens 1024 is the one flag that matters. Without it the server feeds too few image tokens and the model describes images incorrectly (right gist, wrong detail) — the server even logs a warning at load. Adding it fixes the misreads (verified on this quant: a label misread at default tokens read correctly with the flag). Also: it's a thinking model → for one-shot image Q&A use <|think_off|> or allow enough tokens, else the answer can come back empty.
chat_template.jinja (froggeric's unified Qwen3.6 template) — pass --chat-template-file chat_template.jinja for reliable tool calls, inline <|think_off|>/<|think_on|>, and vision in one template.
This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. The recipe — fast single-scale body + Q8 embeddings + Q6 output head + imatrix — keeps the quality levers that are actually felt (faithful embeddings matched to the Q8 source, plus a sharper output head) on the fast body that decodes quickest. We ran the same body-kernel sweep here that we ran on the 27B: an all-dual-scale body trades decode speed for a KL improvement that sits inside the measurement noise, so the fast single-scale body is the right balance point and is the one build we ship.
We don't duplicate the full sweep numbers here — the methodology and the frontier table (KL-vs-reference by build, with the all-dual / selective-higher-precision / dynamic-K-quant comparisons) live on the 27B card (§05), and the conclusion carries over: within ROCmFP4, fast body + faithful embeddings + Q6 head is the optimal balance, and for the last bit of fidelity over throughput a higher-bit GGUF is the better grab (box below). The Deckard-specific reference here is the Q8 source, not BF16 — Deckard's only released full-precision weights are DavidAU's NEO-CODE Q8_0, so that Q8 is our near-lossless anchor.
Does the imatrix help? (measured KL/PPL). Measured on the ROCmFP4 fork (llama-perplexity) over a held-out wiki subset (~20 K tokens), using the Q8 source as the near-lossless reference (PPL 6.1495). Lower = closer to the reference = better:
Every robust metric moves the same direction, and the KL improvement holds across the whole distribution (median −10%, tail −14%) — so it's a real, if modest, gain. Magnitude is small (it's 4-bit imatrix), but it's free (same size/speed). Honest caveats: this is a narrow imatrix-vs-non-imatrix comparison on a ~20 K-token corpus (not an absolute quality benchmark), and the MTP draft head is unaffected (it's dormant in a normal forward pass — speculation is output-lossless either way).
# 0) build an imatrix on the Q8 source — public calibration (Kalomaze groups_merged + froggeric code/technical),
# NOT DavidAU's private neo1-v2. Compute on the Q8; the result is portable to llama-quantize.
llama-imatrix -m Qwen3.6-40B-Deck-...-Q8_0.gguf -f calib-general+code+technical.txt -o deckard-40b.imatrix -c 512 -ngl 999
# 1) graft a 27B nextn head into the 40B Q8_0 (raw binary copy) via PiehSoft's inject_mtp_40b.py
python inject_mtp_40b.py \
--target Qwen3.6-40B-Deck-...-Q8_0.gguf \
--donor Qwen3.6-27B-MTP-BF16.gguf \
--output deckard-40b-mtp.gguf --source-layer 64 --dest-layer 96
# 2) ** GOTCHA ** the inject script adds nextn tensors but does NOT bump block_count.
# An injected MTP model needs block_count += nextn_predict_layers, or the loader
# looks for the head one slot too low (e.g. "missing tensor blk.95.nextn..."):
python gguf_set_metadata.py deckard-40b-mtp.gguf qwen35.block_count 97 --force
# 3) THE ONE BUILD (the ★ file): fast single-scale body + Q8 embeddings + Q6 head + imatrix
# — the best speed/quality balance (§04). --output-tensor-type q6_K raises the lm-head to Q6_K.
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix deckard-40b.imatrix \
deckard-40b-mtp.gguf Qwen3.6-40B-Deckard-MTP-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
Kernels: ~631 tensors on the fast single-scale q4_0_rocmfp4_fast bulk, ~122 on dual-scale q4_0_rocmfp4 (incl. the grafted head).
Why these are re-imatrixed. The base we quantize from is DavidAU's NEO-CODE Q8_0, and yes, that Q8 is itself a NEO imatrix GGUF (built with his neo1-v2 NEO imatrix — it's right in the file's metadata as quantize.imatrix.dataset = neo1-v2.txt). But that NEO imatrix does not help a 4-bit requant, for a simple reason:
- An imatrix only does real work at low bit-widths, where it decides which weights to protect when bits are scarce. At Q8_0 it is essentially inert: Q8 is near-lossless and barely uses importance weighting at all, so a "NEO Q8" and a plain Q8 are nearly the same weights.
- So the NEO benefit does not carry through when you requantize that Q8 down to 4-bit. The place an imatrix actually helps is during the 4-bit step — and our original ROCmFP4 files were quantized from the Q8 with no imatrix at that step.
The -imatrix variants fix exactly that: a fresh importance matrix, computed and applied during the 4-bit ROCmFP4 quantization (the bit-width where it counts). The calibration is not DavidAU's NEO dataset — his neo1-v2 is private and we do not have it. We used a standard, public mix instead:
- Kalomaze's
groups_merged(general) + froggeric'scode+technicalsets, all from thefroggeric/imatrixdataset. - This is the same lineage Bartowski-style imatrix quants are built on (Kalomaze's
groups_mergedis the root of Dampf'scalibration_datav3), just weighted a little more toward code since Deckard is a coder.
So the honest one-liner: these are imatrix quants of DavidAU's NEO-CODE Deckard — his NEO weights as the source, a public-calibration imatrix applied by us at the 4-bit step. We make no claim to his NEO recipe; the -imatrix name reflects what we added, not his brand.
Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
This file is a stack of other people's work — credit where due. The MTP head is not native to the Deckard merge — it's grafted; the draft predictions are "borrowed" from a 27B and happen to align well, they were not trained on this model.
Derivative of the Deckard merge (Qwen3.6 lineage, Apache-2.0). Inherits the base model's terms — verify them before redistribution / use. The MTP graft and quant add no new restrictions.
- Downloads last month
- 7,614
We're not able to determine the quantization variants.