PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWABLE-3.6-27B-MTP
4-BIT ROCmFP4 · imatrix + f16 EMBEDDINGS · Q6_K HEAD · MTP SELF-SPECULATIVE DECODE · SINGLE AMD APU
a ROCmFP4 quant of Mia-AiLab/Qwable-3.6-27b-MTP — a Fable-5 reasoning fine-tune of Qwen3.6-27B

    
      FORMAT
ROCmFP4 4-BIT · STRIX

      PRECISION
4.82 BPW

      SIZE
16.9 GB

      CONTEXT
262 K

    

      DRAFT
MTP n-max 5

      EMBEDDINGS
f16 · HEAD Q6_K

      BACKEND
Vulkan0 · ROCm

      CALIBRATION
imatrix (froggeric)

    

★ ORIGINAL MODEL — ALL CREDIT TO Mia-AiLab

This is a quantization. The model itself is Mia-AiLab/Qwable-3.6-27b-MTP — a full fine-tune of qwen/Qwen3.6-27B on a cleaned Fable-5 reasoning/instruction dataset, with native MTP. Please ⭐ and follow the author: @MiaAI_lab. License inherited: MIT.

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix:

git clone https://github.com/charlie12345/rocmfp4-llama

cd rocmfp4-llama && git checkout mtp-rocmfp4-strix

env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser only knows standard GGUF quant types, can't read ROCmFP4, and "sees" only the genuinely-f16 token embeddings. This is a ~4.82 bpw 4-bit file; ~16.9 GB.

01 · WHAT THIS IS

A 4-bit ROCmFP4 quant of Mia-AiLab/Qwable-3.6-27b-MTP (a Fable-5 reasoning/instruction fine-tune of qwen/Qwen3.6-27B), built on the PlunderStruck STRIX daily-driver recipe for AMD Strix Halo (Ryzen AI Max+ 395, gfx1151):

Body — Q4_0_ROCMFP4_STRIX: FP4 (E2M1) weights + UE4M3 microscale; the Strix quality/speed recipe (dual-scale ROCmFP4 on attention K/V, fast single-scale path on the large tensors).
Token embeddings — kept at f16 (the coherence lever that's actually felt).
Output head — Q6_K.
imatrix — importance matrix over the froggeric general corpus (groups_merged + technical + code), applied to all quantizable tensors.
MTP — Mia's native multi-token-prediction head (blk.64.nextn.*) is preserved → llama.cpp self-speculative decode, no separate draft model.

Bundled in this repo: chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|>), mmproj-F32.gguf (shared Qwen3-VL projector, projection_dim 5120 — vision, optional), and qwable-3.6-27b.imatrix for exact reproduction. Runs entirely on a single AMD APU — Vulkan/RADV or ROCm/HIP (gfx1151).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwable-3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  --alias qwable-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision via the bundled mmproj-F32.gguf; omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set.

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	use the full 128 GB unified memory
`-dev Vulkan0 · -ngl 999 · -fa on`	Vulkan (KHR_coopmat) beats ROCm/HIP here · all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch (256 is the optimum here) · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache; drop to `q8_0`/`q4_0` for less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing (every 256 tok, keep 32, reuse ≥256-tok prefix) + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 "precise" sampling (temp 1.0 for general/creative)
`--spec-type draft-mtp · --spec-draft-n-max 5`	built-in MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 draft KV
`--chat-template-file chat_template.jinja`	bundled froggeric template (tool calls + think-toggle)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	clean `content` + `reasoning_content`; keep `<think>` across turns so cross-turn cache survives
`--mmproj mmproj-F32.gguf · --image-min-tokens 1024`	vision via the bundled Qwen3-VL projector (omit for text-only)

03 · CODING AGENT / CROSS-TURN CACHE

Multi-turn prompt-cache reuse is what makes a 27B usable on one APU. Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn — both fixed by the flags above:

Checkpoint cadence. Default -cpent is 8192, so prompts under 8K never get a usable checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
Thinking text breaking the prefix match. Use --reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' so the template keeps <think> across all turns and reuse holds.

--jinja is required so the chat template (and preserve_thinking) apply. Point any OpenAI-compatible client (OpenCode, etc.) at the server; in single-model mode llama-server ignores the request's model field.

Credits

Model: Mia-AiLab/Qwable-3.6-27b-MTP by Mia-AiLab (@MiaAI_lab) — Fable-5 reasoning fine-tune of qwen/Qwen3.6-27B. All model credit is theirs.
Base: qwen/Qwen3.6-27B.
ROCmFP4 kernels/format: charlie12345/rocmfp4-llama.
Quant + packaging: PlunderStruck. License: MIT (inherited).

Downloads last month: 20

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwable-3.6-27B-MTP-ROCmFP4-GGUF

Base model

Mia-AiLab/Qwable-3.6-27b-MTP

Quantized

(1)

this model

FORMAT ROCmFP4 4-BIT · STRIX	PRECISION 4.82 BPW	SIZE 16.9 GB	CONTEXT 262 K
DRAFT MTP n-max 5	EMBEDDINGS f16 · HEAD Q6_K	BACKEND Vulkan0 · ROCm	CALIBRATION imatrix (froggeric)