▗▇▇▇▇▇▇▇▖
▗█▘▝██████▖
▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
▟▛ ▗█████████████████▙▖
▄▄▄▄▄▟▛ ▟████████████████████▖
▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
▗████▖ ▜▖ ▗█▘
▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
▜█████▙ ▝████████████▛ ▜▙
▜█████▙ ▝██████████▛ ▃ ▜▙
▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
▟███████▖ ▜███▘ ▗███████████▛
▟█████████▄ ▜▛ ▗███████████▀
▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
▜██▘ ▗▛ ▟█████▛▘
▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
▝█▖ ▟█████▛
▝███████▀
FORMAT ROCmFP4 4-BIT · STRIX |
PRECISION 4.82 BPW |
SIZE 16.9 GB |
CONTEXT 262 K |
DRAFT MTP n-max 5 |
EMBEDDINGS f16 · HEAD Q6_K |
BACKEND Vulkan0 · ROCm |
CALIBRATION imatrix (froggeric) |
This is a quantization. The model itself is Mia-AiLab/Qwable-3.6-27b-MTP — a full fine-tune of
qwen/Qwen3.6-27B on a cleaned Fable-5 reasoning/instruction dataset, with native MTP. Please ⭐ and follow the author: @MiaAI_lab. License inherited: MIT.
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix:
git clone https://github.com/charlie12345/rocmfp4-llamacd rocmfp4-llama && git checkout mtp-rocmfp4-strixenv JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
A 4-bit ROCmFP4 quant of Mia-AiLab/Qwable-3.6-27b-MTP (a Fable-5 reasoning/instruction fine-tune of qwen/Qwen3.6-27B), built on the PlunderStruck STRIX daily-driver recipe for AMD Strix Halo (Ryzen AI Max+ 395, gfx1151):
- Body —
Q4_0_ROCMFP4_STRIX: FP4 (E2M1) weights + UE4M3 microscale; the Strix quality/speed recipe (dual-scale ROCmFP4 on attention K/V, fast single-scale path on the large tensors). - Token embeddings — kept at f16 (the coherence lever that's actually felt).
- Output head — Q6_K.
- imatrix — importance matrix over the froggeric general corpus (
groups_merged+technical+code), applied to all quantizable tensors. - MTP — Mia's native multi-token-prediction head (
blk.64.nextn.*) is preserved → llama.cpp self-speculative decode, no separate draft model.
Bundled in this repo: chat_template.jinja (froggeric's unified Qwen3.6 template — tool calls + inline <|think_off|>/<|think_on|>), mmproj-F32.gguf (shared Qwen3-VL projector, projection_dim 5120 — vision, optional), and qwable-3.6-27b.imatrix for exact reproduction. Runs entirely on a single AMD APU — Vulkan/RADV or ROCm/HIP (gfx1151).
Run from the folder holding the .gguf + chat_template.jinja:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwable-3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
--alias qwable-mtp \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 262144 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 5 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--chat-template-file chat_template.jinja \
--reasoning on \
--reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja \
--parallel 1 \
--metrics \
--no-mmap \
--mmproj mmproj-F32.gguf \
--image-min-tokens 1024
The last two lines enable vision via the bundled mmproj-F32.gguf; omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set.
Multi-turn prompt-cache reuse is what makes a 27B usable on one APU. Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn — both fixed by the flags above:
- Checkpoint cadence. Default
-cpentis 8192, so prompts under 8K never get a usable checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256. - Thinking text breaking the prefix match. Use
--reasoning-format deepseek+--chat-template-kwargs '{"preserve_thinking": true}'so the template keeps<think>across all turns and reuse holds.
--jinja is required so the chat template (and preserve_thinking) apply. Point any OpenAI-compatible client (OpenCode, etc.) at the server; in single-model mode llama-server ignores the request's model field.
Credits
- Model: Mia-AiLab/Qwable-3.6-27b-MTP by Mia-AiLab (@MiaAI_lab) — Fable-5 reasoning fine-tune of
qwen/Qwen3.6-27B. All model credit is theirs. - Base: qwen/Qwen3.6-27B.
- ROCmFP4 kernels/format: charlie12345/rocmfp4-llama.
- Quant + packaging: PlunderStruck. License: MIT (inherited).
- Downloads last month
- 20
16-bit
Model tree for plunderstruck/Qwable-3.6-27B-MTP-ROCmFP4-GGUF
Base model
Mia-AiLab/Qwable-3.6-27b-MTP