Instructions to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF", filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Use Docker
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- Unsloth Studio
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- Lemonade
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Run and chat with the model
lemonade run user.Qwen3-Coder-Next-ROCmFP4-GGUF-{{QUANT_TAG}}List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
) โโโโโโโโโ
โโโโโโโโโโโ
โโ โโโโโโโโโโโโโโโโโโ
โโ โโโโโโโโโโโโโโโโโโโโ
โโโโโโโ โโโโโโโโโโโโโโโโโโโโโโ
โโโโ โโ โโโโโโโโโโโโโโโโโโโโโโ
โโโโโโ โโ โโโ
โโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโ โโโโโโโโโโโโโโ โโ
โโโโโโโ โโโโโโโโโโโโ โ โโ
โโโโโโโโ โโโโโโโโโโ โโโ โโ
โโโโโโโโ โโโโโโโโ โโโโโโโโโโโ
โโโโโโโโโ โโโโโ โโโโโโโโโโโโโ
โโโโโโโโโโโ โโ โโโโโโโโโโโโโ
โโโโโโโ โโ โโโโโโโโโโโโโ
โโโโ โโ โโโโโโโโ
โโโโโโโโโโโโโ โโโโโโโ
โโโ โโโโโโโ
โโโโโโโโโ
FORMAT ROCmFP4 4-BIT |
PRECISION ~4.5 BPW |
ARCH QWEN3NEXT |
CONTEXT 262 K |
PARAMS 80B ยท A3B MoE |
DRAFT NO MTP |
BACKEND VULKAN0 |
LICENSE APACHE-2.0 |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama ยท branch mtp-rocmfp4-strix.
Experimental AMD Strix Halo (gfx1151) quant of Qwen3-Coder-Next โ Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch qwen3next, 262K context) โ in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.
One file โ the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt โ Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head โ on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in ยง04) โ it's the point where speed and quality meet best. The DeltaNet-specific tensors (ssm_conv1d, ssm_a, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.
Run from the folder holding the .gguf (the Qwen ChatML template is baked in โ just pass --jinja):
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
--alias coder-next \
--host 0.0.0.0 \
--port 8080 \
-c 262144 \
-ctk q8_0 \
-ctv q8_0 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--jinja \
--parallel 1 \
--metrics \
--no-mmap
--spec-* / --spec-type draft-mtp flags โ this arch has no MTP head (see ยง04). It's already fast on its own.
Qwen3-Coder-Next is an agentic coder โ built to call tools, not narrate code. To wire it up:
- Chat template: Qwen (ChatML) is baked into the GGUF โ just pass
--jinjaand your client applies it automatically. - Tool calling: enable the
qwen3_codertool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools. - Sampling: temp
0.7, top-p0.8, top-k20(Qwen-Coder recommended) โ already set in ยง02.
--cache-reuse / --cache-ram) keeps long agentic sessions cheap โ the leading prompt isn't re-prefilled every turn.
This is the best speed/quality balance in ROCmFP4 โ by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed โ so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.
This mirrors the fuller sweep on our Qwen3.6-27B sibling, where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost โ and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements โ KL vs Q8 on held-out code; reproduce before citing.)
Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention โ ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.
qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds โ in practice it doesn't matter, it's fast on its own.
The imatrix โ code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from eaddario/imatrix-calibration, plus Kalomaze's groups_merged (via froggeric/imatrix) for general.
KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:
So the imatrix measurably improves quantization fidelity to the full model on code (median KL โ20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.
# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999
# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head โ the โ
file (ยง01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
Qwen3-Coder-Next-Q8_0.gguf Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
Experimental research build for AMD Strix Halo โ hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
Derivative quantization โ verify the base model's license before redistribution / use.
- Downloads last month
- 1,435
We're not able to determine the quantization variants.
Model tree for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Base model
Qwen/Qwen3-Coder-Next
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF", filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf", )