Instructions to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Nex-N2-mini-ROCmFP4-GGUF",
	filename="Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Use Docker

docker model run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

LM Studio
Jan
Ollama
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
```

Unsloth Studio

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
```

Lemonade

How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16

Run and chat with the model

lemonade run user.Nex-N2-mini-ROCmFP4-GGUF-BF16

List all available models

lemonade list

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
███████▆▄▁     ██████▎▐██████████████▖▀██████▙      ▟██████▘
██████████▆▃▁  ██████▎▐███████████████▙▝▜██████▖  ▗██████▛  
█████████████▆ ██████▎▐████▀▀▀▀▀▀▀▀▀▀▀▀▀ ▀██████▙▟██████▘   
██████▏▀▜█████ ██████▎▐████▃▃▃▃▃▃▃▃▃▃▃▃▃▖ ▔▜██████▘ ▀█▛     
██████▏▗▂▔▀▜██ ██████▎▐█████████████████▍   ▝███▛▁▟▙        
██████▏▐█▇▅▂▔▀ ██████▎▐█████████████████▍    ▔▀▘▃████▖      
██████▏▐████▇▄▂██████▎▐████▀▀▀▀▀▀▀▀▀▀▀▀▀  ▗▟█▖▁▟██████▙▁    
██████▏▝█████████████▎▐████▅▅▅▅▅▅▅▅▅▅▅▅▅ ▄██████▛▜██████▖   
██████▏  ▝▜██████████▎▐███████████████▛▗▟██████▘  ▝██████▙▁ 
██████▏     ▀▜███████▎▐██████████████▘▄██████▛      ▜██████▖
NEX-N2-MINI
4-BIT ROCmFP4 · CODE-WEIGHTED IMATRIX · HIGH-SPARSITY MoE (3B ACTIVE) · AGENTIC CODER · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      SIZE
18.4 GB

      CONTEXT
131 K

    

      ARCH
qwen35moe

      PARAMS
35B / 3B ACTIVE

      BACKEND
VULKAN0

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.5 bpw 4-bit files; pick by filename.

01 · FILES

File	Size	Output head	Pick if
`…-STRIX-embF16-imatrix-headQ6.gguf` ★	18.4 GB	Q6_K	the one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + the code-weighted imatrix (see §04). Not the leanest-fastest possible (a 4-bit output head squeezes out a few more tok/s, at a fidelity cost), and not the most faithful possible (see the base-model fidelity link in §04) — it's the point where speed and quality meet best. The Qwen (ChatML) chat template is baked into the GGUF — just pass --jinja.

02 · QUICK START

Run from the folder holding the .gguf:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
  --alias nex-n2-mini \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 131072 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap

NOTE // No --spec-* / --spec-type draft-mtp flags — Nex-N2-mini ships without an MTP head (non-speculative). At ~72 t/s it doesn't need speculative decoding to be quick.

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 131072`	context length (128K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	base-model recommended sampling
`--jinja --parallel 1 --metrics --no-mmap`	apply baked ChatML template · single slot · metrics · weights in RAM

03 · AGENTIC CODING / TOOLS

Nex-N2-mini is an agentic / "thinking" coder — agentic tool-use trained. To get native tool calls, your client must use the qwen3_coder tool-call parser. Without it the model tends to narrate code instead of emitting structured tool calls.

CHAT TEMPLATE	Qwen (ChatML) — baked into the GGUF; pass `--jinja`
TOOL-CALL PARSER	`qwen3_coder` — set in your client/runtime
SAMPLING	temp `0.6` · top-p `0.95` · top-k `20` (base-model recommended)

04 · PERFORMANCE & QUALITY

DECODE · short-context	~72 t/s (Vulkan / Ryzen AI Max+ 395)
SWE-BENCH VERIFIED · base model	74.4
ACTIVE PARAMS	3B of 35B (high-sparsity MoE)
QUANTIZATION	fast single-scale body + f16 embeddings + Q6 head + code-weighted imatrix

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. It keeps the two quality levers that are actually felt — genuine f16 token embeddings and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body. A leaner 4-bit-output-head build is a few tok/s faster but degrades fidelity you'll notice; an all-dual-scale body buys a KL improvement that sits inside the measurement noise while costing decode speed. The fast body + f16 embeddings + Q6 head is the point where those meet best.

How we landed on this recipe. We ran the full body-kernel / head-precision / dual-scale sweep — KL divergence vs the BF16 reference plus llama-bench decode — on the dense Qwen3.6-27B sibling, where the same q4_0_rocmfp4 levers apply. The frontier there was unambiguous: the all-dual-scale body and selective higher-precision tensors both traded decode speed for a KL gain inside the noise, so the fast body + f16 embeddings + Q6 head won the balance. We carry that conclusion to this MoE rather than re-running the whole sweep per model — see the 27B sweep for the numbers and the format-limit reasoning. (Directional internal measurements — reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8_0 GGUF of the base from nex-agi/Nex-N2-mini — those higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, a higher-bit quant of the base is the one to grab.

The imatrix — code-weighted, and measured (it helps here). Quantized with an importance matrix from a code-weighted calibration mix (~2.6:1 code:general — eaddario code + Kalomaze groups_merged via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out code slice (disjoint from calibration):

Metric (vs BF16, held-out code)	No-imatrix	Imatrix	Change
Perplexity	4.076	4.013	−1.5% (recovers >½ the 4-bit loss; ~3.3σ)
Median KLD	0.0184	0.0159	−13%
RMS Δp	8.57%	8.00%	−7%
Same top token as BF16	88.97%	89.44%	+0.5 pp

For this model the imatrix is a clean win — better on every metric, including perplexity. (It's model-dependent — on the dense Qwopus-Coder the same recipe worsened code-PPL, so we shipped that one without imatrix. Always measure.)

05 · BUILD (REPRODUCIBLE)

# code-weighted imatrix on the BF16 (single pass)
llama-imatrix -m Nex-N2-mini-bf16.gguf -f code-weighted-calib.txt -o nexn2.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix + genuine f16 embeddings
llama-quantize --token-embedding-type f16 --imatrix nexn2.imatrix \
  Nex-N2-mini-bf16.gguf \
  Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

# THE ONE BUILD (★): add the Q6_K output head on the fast single-scale body — best speed/quality balance (§04)
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix nexn2.imatrix \
  Nex-N2-mini-bf16.gguf \
  Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

BASE MODEL	nex-agi/Nex-N2-mini (Apache-2.0) · Qwen3.5-35B-A3B lineage (35B total / 3B active MoE)
CALIBRATION	eaddario/imatrix-calibration (code) + Kalomaze `groups_merged` via froggeric/imatrix (general)
FORMAT + RUNTIME	charlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: 2,998

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF

Base model

nex-agi/Nex-N2-mini

Quantized

(51)

this model

FORMAT ROCmFP4 4-BIT	PRECISION ~4.5 BPW	SIZE 18.4 GB	CONTEXT 131 K
ARCH qwen35moe	PARAMS 35B / 3B ACTIVE	BACKEND VULKAN0	LICENSE APACHE-2.0