Instructions to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF", filename="Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF # Run inference directly in the terminal: llama cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF # Run inference directly in the terminal: llama cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF # Run inference directly in the terminal: ./llama-cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Use Docker
docker model run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
- LM Studio
- Jan
- Ollama
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Ollama:
ollama run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
- Unsloth Studio
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting
- Pi
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Docker Model Runner:
docker model run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
- Lemonade
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Run and chat with the model
lemonade run user.Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Qwen3.6-27B-MTP ROCmFP4_FAST — GGUF
ROCmFP4_FAST quant of Qwen/Qwen3.6-27B (Apache 2.0), produced via charlie12345/ROCmFPX. Benchmarked against the mesh's agent-production regression suite on an RDNA4 RX 9060 XT (16 GB). Context ceiling: 131k at q4_0 KV with flat ~17.2 t/s throughput. Cross-format parity with TQ3_4S (Blackwell) on IFEval: both within 2pp.
MTP speculative decoding is a massive win on this card class: +91% throughput (32.93 t/s) at 64k q4_0 KV. The full model + MTP draft head fits in VRAM at 64k; no PCIe spill.
File
| File | Size | Quant | BPW |
|---|---|---|---|
Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf |
14.5 GB | Q4_0_ROCMFP4_FAST | ~4.25 bpw |
NOT a stock llama.cpp quant
ROCmFP4_FAST is a custom weight format from charlie12345/ROCmFPX (Q4_0_ROCMFP4 preset). Stock llama.cpp will exit with unknown quantization at load time. The system_fingerprint of a correctly-served ROCmFPX GGUF is b1-11d76c2 — a different fingerprint means the wrong binary loaded the file.
Scope of these benchmarks — read this first
These numbers are a light baseline, not a thorough quant evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:
- Harness scope is bounded. The numbers below come from
llama-bench(KV context ladder),lm-eval-harnessIFEval n=50, and Hermesctx_scaling_bench(KV precision sweep). That's a regression suite, not a quality benchmark. - Sample sizes are small. Throughput numbers are single-rep on a single GPU. IFEval is n=50. None are powered for multi-seed significance.
- No perplexity / wikitext / MMLU. Those are upstream's territory. For a rigorous view, see charlie12345/ROCmFPX's own validation ladder.
- Single GPU class (RDNA4 16 GB). All measurements on an RX 9060 XT (gfx1201), ROCm 7.2.3. No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan. Cross-hardware generalization is NOT implied. The companion TQ3_4S quant for Blackwell is in a separate repo.
- No human eval. "IFEval parity and flat throughput" is not a quality verdict on this specific quant for every use case.
What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.
For a rigorous view, see Qwen/Qwen3.6-27B (parent model), charlie12345/ROCmFPX (quantizer), and the meshina-benches repo for the full raw bench reports.
What we measured
Context ceiling (q4_0 KV, AMD RDNA4 RX 9060 XT)
Throughput is flat across the entire tested range. KV precision affects VRAM, not token speed.
| KV type | Ctx | TG tok/s | PP tok/s | VRAM (MiB) | Status |
|---|---|---|---|---|---|
| f16 | 32768 | 17.7 | 102.7 | 16174 | OK |
| q4_0 | 65536 | 17.26 | 100.54 | 15282 | OK |
| q4_0 | 98304 | 17.22 | 100.11 | 15857 | OK |
| q4_0 | 131072 | 17.23 | 99.93 | 16282 | OK |
| q8_0 | 65536 | 17.36 | 101.18 | 16236 | OK |
Ceilings: f16 KV caps at 32k, q8_0 at 64k, q4_0 reaches 131k (OOM at 152k — <200 MiB headroom). Asymmetric q8-K + q4-V fails at all ctx sizes.
IFEval n=50 — ROCmFP4_FAST (RDNA4)
| Metric | ROCmFP4_FAST (AMD) | TQ3_4S (Blackwell) | Δ |
|---|---|---|---|
| prompt_level_loose | 0.32 ± 0.067 | 0.34 ± 0.068 | -0.02 |
| inst_level_loose | 0.487 | 0.474 | +0.013 |
| inst_level_strict | 0.487 | 0.461 | +0.026 |
Cross-format parity holds within 2pp on all metrics.
MTP speculative decoding — the RDNA4 win
MTP is a massive throughput win on AMD RDNA4. The full model + 3 GB MTP draft head fits in VRAM at 64k, enabling on-GPU draft verification with no PCIe spill.
| Config | Ctx | TG t/s | vs MTP-OFF |
|---|---|---|---|
| MTP-OFF | 32k-131k | 17.2 | baseline |
| MTP-ON n_max=3 (recommended) | 64k | 32.93 | +91% |
| MTP-ON n_max=6 | 32k | 28.92 | +68% (worse) |
Recommendation for AMD RDNA4: use MTP-ON with n_max=3 for contexts ≤64k. For >64k, drop to MTP-OFF (q4_0 KV reaches 131k).
Critical: Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 on discrete AMD cards. Charlie's ROCmFPX scripts default to this flag (correct for Strix Halo), but on discrete RDNA4 it moves the ENTIRE model to system RAM (30× regression — 0.97 t/s). The ROCmFP4_FAST quant + draft head fits in 16 GB VRAM at 64k without unified memory.
GSM8K note
AMD ROCmFP4_FAST scores 0.02 strict (1/50) on GSM8K. This is a genuine quality floor on the 27B model at this bit depth on 16 GB VRAM — the model is too tight for sustained eval without prompt cache operations crashing. Do not use this quant for math reasoning on 16 GB cards. The Blackwell TQ3_4S companion achieves 0.955 GSM8K at the same bit depth.
Quick start
# Build charlie12345/ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
make -j$(nproc)
# Serve (MTP-OFF, direct decode)
./bin/llama-server \
-m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
--port 8081 \
-ngl 99 \
-c 32768 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--cache-ram 0 \
--no-cache-prompt
# Serve (MTP-ON, n_max=3 — recommended for ≤64k ctx)
./bin/llama-server \
-m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
--port 8081 \
-ngl 99 \
-c 65536 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--speculative-model m \
--spec-draft-n-max 3 \
--cache-ram 0 \
--no-cache-prompt
Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 in the environment on discrete AMD cards.
Reproduce the quant
# From unsloth/Qwen3.6-27B-MTP-GGUF BF16 source (SHA256-validated, single-step)
/path/to/llama-quantize \
--allow-requantize \
/path/to/Qwen3.6-27B-MTP-BF16.gguf \
/path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
Q4_0_ROCMFP4_FAST
Files in this repo
| File | Purpose |
|---|---|
Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf |
The quantized model (LFS-tracked, 14.5 GB) |
README.md |
This file |
Full raw bench reports, summary markdowns, and reproduction scripts are at github.com/maczzgit/meshina-benches in raw/benchmarks/2026-07-04-context-push-and-parity-v2/.
What's NOT in this repo (caveats)
- Stock llama.cpp will not load this file. ROCmFP4_FAST is a custom weight format unique to charlie12345/ROCmFPX. Use that fork's
llama-server. - No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression in charlie12345/ROCmFPX — we did not test it.
- 131k ctx is the practical ceiling on this hardware due to VRAM. 256K requires larger VRAM or smaller model.
- GSM8K is near-zero (0.02). The 27B model at 4-bit is too tight on 16 GB RDNA4 for sustained math reasoning. Use the companion TQ3_4S Blackwell quant for math tasks.
- No vision/multimodal test. This variant is text-only.
- No quality benchmark (perplexity, MMLU). The quant passes IFEval parity; whether it's "the best ROCmFP4 quant" needs upstream validation.
- 16 GB minimum VRAM. Does not fit on smaller cards. The mesh's 16 GB card runs it with ~150 MiB headroom at 131k.
Provenance
- Source model: Qwen/Qwen3.6-27B (Apache 2.0)
- Intermediate GGUF: unsloth/Qwen3.6-27B-MTP-GGUF (SHA256-verified across nodes, single-step quant)
- Quantizer: charlie12345/ROCmFPX commit
5b39566, presetQ4_0_ROCMFP4_FAST - Build hardware: Node B — RDNA4 RX 9060 XT 16 GB (gfx1201), ROCm 7.2.3, NixOS
- Bench harnesses:
llama-bench(context ladder),lm-eval-harness(IFEval n=50),ctx_scaling_bench(KV precision sweep) - Bench report: meshina-benches/2026-07-04-context-push-and-parity-v2
License
The model weights are derived from Qwen/Qwen3.6-27B (Apache 2.0). The ROCmFP4 quant format is provided by charlie12345/ROCmFPX (MIT). This repo is a derivative quant — the Apache 2.0 license of the parent applies to the model weights; the quantizer tooling is separately licensed.
- Downloads last month
- -
We're not able to determine the quantization variants.
Model tree for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
Base model
Qwen/Qwen3.6-27B