PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWABLE-3.6-27B-MTP
4-BIT ROCmFP4 · imatrix + f16 EMBEDDINGS · Q6_K HEAD · MTP SELF-SPECULATIVE DECODE · SINGLE AMD APU
a ROCmFP4 quant of Mia-AiLab/Qwable-3.6-27b-MTP — a Fable-5 reasoning fine-tune of Qwen3.6-27B
FORMAT
ROCmFP4 4-BIT · STRIX
PRECISION
4.82 BPW
SIZE
16.9 GB
CONTEXT
262 K
DRAFT
MTP n-max 5
EMBEDDINGS
f16 · HEAD Q6_K
BACKEND
Vulkan0 · ROCm
CALIBRATION
imatrix (froggeric)
★ ORIGINAL MODEL — ALL CREDIT TO Mia-AiLab
This is a quantization. The model itself is Mia-AiLab/Qwable-3.6-27b-MTP — a full fine-tune of qwen/Qwen3.6-27B on a cleaned Fable-5 reasoning/instruction dataset, with native MTP. Please ⭐ and follow the author: @MiaAI_lab. License inherited: MIT.
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser only knows standard GGUF quant types, can't read ROCmFP4, and "sees" only the genuinely-f16 token embeddings. This is a ~4.82 bpw 4-bit file; ~16.9 GB.
01 · WHAT THIS IS

A 4-bit ROCmFP4 quant of Mia-AiLab/Qwable-3.6-27b-MTP (a Fable-5 reasoning/instruction fine-tune of qwen/Qwen3.6-27B), built on the PlunderStruck STRIX daily-driver recipe for AMD Strix Halo (Ryzen AI Max+ 395, gfx1151):

  • BodyQ4_0_ROCMFP4_STRIX: FP4 (E2M1) weights + UE4M3 microscale; the Strix quality/speed recipe (dual-scale ROCmFP4 on attention K/V, fast single-scale path on the large tensors).
  • Token embeddings — kept at f16 (the coherence lever that's actually felt).
  • Output headQ6_K.
  • imatrix — importance matrix over the froggeric general corpus (groups_merged + technical + code), applied to all quantizable tensors.
  • MTP — Mia's native multi-token-prediction head (blk.64.nextn.*) is preserved → llama.cpp self-speculative decode, no separate draft model.

Bundled in this repo: chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|>), mmproj-F32.gguf (shared Qwen3-VL projector, projection_dim 5120 — vision, optional), and qwable-3.6-27b.imatrix for exact reproduction. Runs entirely on a single AMD APU — Vulkan/RADV or ROCm/HIP (gfx1151).

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwable-3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
  --alias qwable-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 262144 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision via the bundled mmproj-F32.gguf; omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set.

Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1use the full 128 GB unified memory
-dev Vulkan0 · -ngl 999 · -fa onVulkan (KHR_coopmat) beats ROCm/HIP here · all layers · flash attention
-c 262144context length (256K)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch (256 is the optimum here) · CPU threads
-ctk f16 · -ctv f16f16 KV cache; drop to q8_0/q4_0 for less memory
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing (every 256 tok, keep 32, reuse ≥256-tok prefix) + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0Qwen3.6 "precise" sampling (temp 1.0 for general/creative)
--spec-type draft-mtp · --spec-draft-n-max 5built-in MTP head, self-speculative; draft depth 5
--spec-draft-device Vulkan0 · -ngl all · type-k/v f16draft head on Vulkan, fully offloaded, f16 draft KV
--chat-template-file chat_template.jinjabundled froggeric template (tool calls + think-toggle)
--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}clean content + reasoning_content; keep <think> across turns so cross-turn cache survives
--mmproj mmproj-F32.gguf · --image-min-tokens 1024vision via the bundled Qwen3-VL projector (omit for text-only)
03 · CODING AGENT / CROSS-TURN CACHE

Multi-turn prompt-cache reuse is what makes a 27B usable on one APU. Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn — both fixed by the flags above:

  1. Checkpoint cadence. Default -cpent is 8192, so prompts under 8K never get a usable checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
  2. Thinking text breaking the prefix match. Use --reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' so the template keeps <think> across all turns and reuse holds.

--jinja is required so the chat template (and preserve_thinking) apply. Point any OpenAI-compatible client (OpenCode, etc.) at the server; in single-model mode llama-server ignores the request's model field.

Credits

Downloads last month
20
GGUF
Model size
0.5B params
Architecture
clip
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwable-3.6-27B-MTP-ROCmFP4-GGUF

Quantized
(1)
this model