PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO ยท gfx1151
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–„โ–     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––โ–€โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™      โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–ƒโ–  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ–โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––  โ–—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›  
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–† โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€ โ–€โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜   
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–€โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–– โ–”โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜ โ–€โ–ˆโ–›     
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–—โ–‚โ–”โ–€โ–œโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–   โ–โ–ˆโ–ˆโ–ˆโ–›โ–โ–Ÿโ–™        
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–ˆโ–‡โ–…โ–‚โ–”โ–€ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–    โ–”โ–€โ–˜โ–ƒโ–ˆโ–ˆโ–ˆโ–ˆโ––      
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–„โ–‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€โ–€  โ–—โ–Ÿโ–ˆโ––โ–โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ–    
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–…โ–… โ–„โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––   
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–  โ–โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›โ–—โ–Ÿโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜  โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–™โ– 
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–     โ–€โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Žโ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–˜โ–„โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–›      โ–œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ––
NEX-N2-MINI
4-BIT ROCmFP4 ยท CODE-WEIGHTED IMATRIX ยท HIGH-SPARSITY MoE (3B ACTIVE) ยท AGENTIC CODER ยท SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
SIZE
18.4 GB
CONTEXT
131 K
ARCH
qwen35moe
PARAMS
35B / 3B ACTIVE
BACKEND
VULKAN0
LICENSE
APACHE-2.0
โš  REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama ยท branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16" badge โ€” its parser can't read ROCmFP4 and mislabels by the f16 embeddings. These are ~4.5 bpw 4-bit files; pick by filename.
01 ยท FILES
File Size Output head Pick if
โ€ฆ-STRIX-embF16-imatrix-headQ6.gguf โ˜…18.4 GBQ6_Kthe one build โ€” best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file โ€” the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt โ€” genuine f16 token embeddings (from BF16) and a Q6_K output head โ€” on the fast single-scale q4_0_rocmfp4_fast body + the code-weighted imatrix (see ยง04). Not the leanest-fastest possible (a 4-bit output head squeezes out a few more tok/s, at a fidelity cost), and not the most faithful possible (see the base-model fidelity link in ยง04) โ€” it's the point where speed and quality meet best. The Qwen (ChatML) chat template is baked into the GGUF โ€” just pass --jinja.

02 ยท QUICK START

Run from the folder holding the .gguf:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
  --alias nex-n2-mini \
  --host 0.0.0.0 \
  --port 8080 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -c 131072 \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -ctk f16 \
  -ctv f16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap
NOTE // No --spec-* / --spec-type draft-mtp flags โ€” Nex-N2-mini ships without an MTP head (non-speculative). At ~72 t/s it doesn't need speculative decoding to be quick.
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan โ€” fastest backend for ROCmFP4 on Strix Halo
-ngl 999 ยท -fa onoffload all layers ยท flash attention
-c 131072context length (128K)
-b 2048 ยท -ub 256 ยท -t/-tb 16prefill batch / micro-batch ยท CPU threads
-ctk f16 ยท -ctv f16f16 KV cache โ€” how we run it; drop to q8_0/q4_0 to use less memory
-cpent ยท -ctxcp ยท --cache-reuse ยท --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0base-model recommended sampling
--jinja --parallel 1 --metrics --no-mmapapply baked ChatML template ยท single slot ยท metrics ยท weights in RAM
03 ยท AGENTIC CODING / TOOLS

Nex-N2-mini is an agentic / "thinking" coder โ€” agentic tool-use trained. To get native tool calls, your client must use the qwen3_coder tool-call parser. Without it the model tends to narrate code instead of emitting structured tool calls.

CHAT TEMPLATEQwen (ChatML) โ€” baked into the GGUF; pass --jinja
TOOL-CALL PARSERqwen3_coder โ€” set in your client/runtime
SAMPLINGtemp 0.6 ยท top-p 0.95 ยท top-k 20 (base-model recommended)
04 ยท PERFORMANCE & QUALITY
DECODE ยท short-context~72 t/s (Vulkan / Ryzen AI Max+ 395)
SWE-BENCH VERIFIED ยท base model74.4
ACTIVE PARAMS3B of 35B (high-sparsity MoE)
QUANTIZATIONfast single-scale body + f16 embeddings + Q6 head + code-weighted imatrix

This is the best speed/quality balance in ROCmFP4 โ€” by design, not the absolute fastest. It keeps the two quality levers that are actually felt โ€” genuine f16 token embeddings and a Q6_K output head โ€” on the fast single-scale q4_0_rocmfp4_fast body. A leaner 4-bit-output-head build is a few tok/s faster but degrades fidelity you'll notice; an all-dual-scale body buys a KL improvement that sits inside the measurement noise while costing decode speed. The fast body + f16 embeddings + Q6 head is the point where those meet best.

How we landed on this recipe. We ran the full body-kernel / head-precision / dual-scale sweep โ€” KL divergence vs the BF16 reference plus llama-bench decode โ€” on the dense Qwen3.6-27B sibling, where the same q4_0_rocmfp4 levers apply. The frontier there was unambiguous: the all-dual-scale body and selective higher-precision tensors both traded decode speed for a KL gain inside the noise, so the fast body + f16 embeddings + Q6 head won the balance. We carry that conclusion to this MoE rather than re-running the whole sweep per model โ€” see the 27B sweep for the numbers and the format-limit reasoning. (Directional internal measurements โ€” reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8_0 GGUF of the base from nex-agi/Nex-N2-mini โ€” those higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, a higher-bit quant of the base is the one to grab.

The imatrix โ€” code-weighted, and measured (it helps here). Quantized with an importance matrix from a code-weighted calibration mix (~2.6:1 code:general โ€” eaddario code + Kalomaze groups_merged via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out code slice (disjoint from calibration):

Metric (vs BF16, held-out code) No-imatrix Imatrix Change
Perplexity4.0764.013โˆ’1.5% (recovers >ยฝ the 4-bit loss; ~3.3ฯƒ)
Median KLD0.01840.0159โˆ’13%
RMS ฮ”p8.57%8.00%โˆ’7%
Same top token as BF1688.97%89.44%+0.5 pp

For this model the imatrix is a clean win โ€” better on every metric, including perplexity. (It's model-dependent โ€” on the dense Qwopus-Coder the same recipe worsened code-PPL, so we shipped that one without imatrix. Always measure.)

05 ยท BUILD (REPRODUCIBLE)
# code-weighted imatrix on the BF16 (single pass)
llama-imatrix -m Nex-N2-mini-bf16.gguf -f code-weighted-calib.txt -o nexn2.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix + genuine f16 embeddings
llama-quantize --token-embedding-type f16 --imatrix nexn2.imatrix \
  Nex-N2-mini-bf16.gguf \
  Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix.gguf  Q4_0_ROCMFP4_STRIX

# THE ONE BUILD (โ˜…): add the Q6_K output head on the fast single-scale body โ€” best speed/quality balance (ยง04)
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix nexn2.imatrix \
  Nex-N2-mini-bf16.gguf \
  Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo โ€” hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 ยท LINEAGE & CREDITS
BASE MODELnex-agi/Nex-N2-mini (Apache-2.0) ยท Qwen3.5-35B-A3B lineage (35B total / 3B active MoE)
CALIBRATIONeaddario/imatrix-calibration (code) + Kalomaze groups_merged via froggeric/imatrix (general)
FORMAT + RUNTIMEcharlie12345/rocmfp4-llama (based on llama.cpp, MIT)

Derivative quantization โ€” verify the base model's license before redistribution / use.

Downloads last month
2,998
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF

Quantized
(51)
this model