Instructions to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF",
	filename="Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
# Run inference directly in the terminal:
./llama-cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Use Docker

docker model run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

LM Studio
Jan
Ollama
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Ollama:
```
ollama run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
```

Unsloth Studio

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF to start chatting

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Docker Model Runner:
```
docker model run hf.co/maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF
```

Lemonade

How to use maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull maczzzzzz/Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF

Run and chat with the model

lemonade run user.Qwen3.6-27B-MTP-ROCmFP4_FAST-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3.6-27B-MTP ROCmFP4_FAST — GGUF

ROCmFP4_FAST quant of Qwen/Qwen3.6-27B (Apache 2.0), produced via charlie12345/ROCmFPX. Benchmarked against the mesh's agent-production regression suite on an RDNA4 RX 9060 XT (16 GB). Context ceiling: 131k at q4_0 KV with flat ~17.2 t/s throughput. Cross-format parity with TQ3_4S (Blackwell) on IFEval: both within 2pp.

MTP speculative decoding is a massive win on this card class: +91% throughput (32.93 t/s) at 64k q4_0 KV. The full model + MTP draft head fits in VRAM at 64k; no PCIe spill.

File

File	Size	Quant	BPW
`Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf`	14.5 GB	Q4_0_ROCMFP4_FAST	~4.25 bpw

NOT a stock llama.cpp quant

ROCmFP4_FAST is a custom weight format from charlie12345/ROCmFPX (Q4_0_ROCMFP4 preset). Stock llama.cpp will exit with unknown quantization at load time. The system_fingerprint of a correctly-served ROCmFPX GGUF is b1-11d76c2 — a different fingerprint means the wrong binary loaded the file.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough quant evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

Harness scope is bounded. The numbers below come from llama-bench (KV context ladder), lm-eval-harness IFEval n=50, and Hermes ctx_scaling_bench (KV precision sweep). That's a regression suite, not a quality benchmark.
Sample sizes are small. Throughput numbers are single-rep on a single GPU. IFEval is n=50. None are powered for multi-seed significance.
No perplexity / wikitext / MMLU. Those are upstream's territory. For a rigorous view, see charlie12345/ROCmFPX's own validation ladder.
Single GPU class (RDNA4 16 GB). All measurements on an RX 9060 XT (gfx1201), ROCm 7.2.3. No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan. Cross-hardware generalization is NOT implied. The companion TQ3_4S quant for Blackwell is in a separate repo.
No human eval. "IFEval parity and flat throughput" is not a quality verdict on this specific quant for every use case.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, see Qwen/Qwen3.6-27B (parent model), charlie12345/ROCmFPX (quantizer), and the meshina-benches repo for the full raw bench reports.

What we measured

Context ceiling (q4_0 KV, AMD RDNA4 RX 9060 XT)

Throughput is flat across the entire tested range. KV precision affects VRAM, not token speed.

KV type	Ctx	TG tok/s	PP tok/s	VRAM (MiB)	Status
f16	32768	17.7	102.7	16174	OK
q4_0	65536	17.26	100.54	15282	OK
q4_0	98304	17.22	100.11	15857	OK
q4_0	131072	17.23	99.93	16282	OK
q8_0	65536	17.36	101.18	16236	OK

Ceilings: f16 KV caps at 32k, q8_0 at 64k, q4_0 reaches 131k (OOM at 152k — <200 MiB headroom). Asymmetric q8-K + q4-V fails at all ctx sizes.

IFEval n=50 — ROCmFP4_FAST (RDNA4)

Metric	ROCmFP4_FAST (AMD)	TQ3_4S (Blackwell)	Δ
prompt_level_loose	0.32 ± 0.067	0.34 ± 0.068	-0.02
inst_level_loose	0.487	0.474	+0.013
inst_level_strict	0.487	0.461	+0.026

Cross-format parity holds within 2pp on all metrics.

MTP speculative decoding — the RDNA4 win

MTP is a massive throughput win on AMD RDNA4. The full model + 3 GB MTP draft head fits in VRAM at 64k, enabling on-GPU draft verification with no PCIe spill.

Config	Ctx	TG t/s	vs MTP-OFF
MTP-OFF	32k-131k	17.2	baseline
MTP-ON n_max=3 (recommended)	64k	32.93	+91%
MTP-ON n_max=6	32k	28.92	+68% (worse)

Recommendation for AMD RDNA4: use MTP-ON with n_max=3 for contexts ≤64k. For >64k, drop to MTP-OFF (q4_0 KV reaches 131k).

Critical: Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 on discrete AMD cards. Charlie's ROCmFPX scripts default to this flag (correct for Strix Halo), but on discrete RDNA4 it moves the ENTIRE model to system RAM (30× regression — 0.97 t/s). The ROCmFP4_FAST quant + draft head fits in 16 GB VRAM at 64k without unified memory.

GSM8K note

AMD ROCmFP4_FAST scores 0.02 strict (1/50) on GSM8K. This is a genuine quality floor on the 27B model at this bit depth on 16 GB VRAM — the model is too tight for sustained eval without prompt cache operations crashing. Do not use this quant for math reasoning on 16 GB cards. The Blackwell TQ3_4S companion achieves 0.955 GSM8K at the same bit depth.

Quick start

# Build charlie12345/ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
mkdir build && cd build
cmake .. -DGGML_CUDA=ON
make -j$(nproc)

# Serve (MTP-OFF, direct decode)
./bin/llama-server \
  -m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  --port 8081 \
  -ngl 99 \
  -c 32768 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --cache-ram 0 \
  --no-cache-prompt

# Serve (MTP-ON, n_max=3 — recommended for ≤64k ctx)
./bin/llama-server \
  -m /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  --port 8081 \
  -ngl 99 \
  -c 65536 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --speculative-model m \
  --spec-draft-n-max 3 \
  --cache-ram 0 \
  --no-cache-prompt

Do NOT set GGML_HIP_ENABLE_UNIFIED_MEMORY=1 in the environment on discrete AMD cards.

Reproduce the quant

# From unsloth/Qwen3.6-27B-MTP-GGUF BF16 source (SHA256-validated, single-step)
/path/to/llama-quantize \
  --allow-requantize \
  /path/to/Qwen3.6-27B-MTP-BF16.gguf \
  /path/to/Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf \
  Q4_0_ROCMFP4_FAST

Files in this repo

File	Purpose
`Qwen3.6-27B-MTP-ROCmFP4_FAST.gguf`	The quantized model (LFS-tracked, 14.5 GB)
`README.md`	This file

Full raw bench reports, summary markdowns, and reproduction scripts are at github.com/maczzgit/meshina-benches in raw/benchmarks/2026-07-04-context-push-and-parity-v2/.

What's NOT in this repo (caveats)

Stock llama.cpp will not load this file. ROCmFP4_FAST is a custom weight format unique to charlie12345/ROCmFPX. Use that fork's llama-server.
No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression in charlie12345/ROCmFPX — we did not test it.
131k ctx is the practical ceiling on this hardware due to VRAM. 256K requires larger VRAM or smaller model.
GSM8K is near-zero (0.02). The 27B model at 4-bit is too tight on 16 GB RDNA4 for sustained math reasoning. Use the companion TQ3_4S Blackwell quant for math tasks.
No vision/multimodal test. This variant is text-only.
No quality benchmark (perplexity, MMLU). The quant passes IFEval parity; whether it's "the best ROCmFP4 quant" needs upstream validation.
16 GB minimum VRAM. Does not fit on smaller cards. The mesh's 16 GB card runs it with ~150 MiB headroom at 131k.

Provenance

Source model: Qwen/Qwen3.6-27B (Apache 2.0)
Intermediate GGUF: unsloth/Qwen3.6-27B-MTP-GGUF (SHA256-verified across nodes, single-step quant)
Quantizer: charlie12345/ROCmFPX commit 5b39566, preset Q4_0_ROCMFP4_FAST
Build hardware: Node B — RDNA4 RX 9060 XT 16 GB (gfx1201), ROCm 7.2.3, NixOS
Bench harnesses: llama-bench (context ladder), lm-eval-harness (IFEval n=50), ctx_scaling_bench (KV precision sweep)
Bench report: meshina-benches/2026-07-04-context-push-and-parity-v2

License

The model weights are derived from Qwen/Qwen3.6-27B (Apache 2.0). The ROCmFP4 quant format is provided by charlie12345/ROCmFPX (MIT). This repo is a derivative quant — the Apache 2.0 license of the parent applies to the model weights; the quantizer tooling is separately licensed.