Instructions to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Nex-N2-mini-ROCmFP4-GGUF", filename="Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Use Docker
docker model run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
- Unsloth Studio
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
- Lemonade
How to use plunderstruck/Nex-N2-mini-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Nex-N2-mini-ROCmFP4-GGUF:BF16
Run and chat with the model
lemonade run user.Nex-N2-mini-ROCmFP4-GGUF-BF16
List all available models
lemonade list
โโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโ โโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ โโโโโโโโโโโ โโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ โ โ โ โ โ โ โ โ โ โโโโโโโโโโโโโโโโ โโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโ โโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโ
FORMAT ROCmFP4 4-BIT |
PRECISION ~4.5 BPW |
SIZE 18.4 GB |
CONTEXT 131 K |
ARCH qwen35moe |
PARAMS 35B / 3B ACTIVE |
BACKEND VULKAN0 |
LICENSE APACHE-2.0 |
The custom
q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/rocmfp4-llama ยท branch mtp-rocmfp4-strix.
One file โ the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt โ genuine f16 token embeddings (from BF16) and a Q6_K output head โ on the fast single-scale q4_0_rocmfp4_fast body + the code-weighted imatrix (see ยง04). Not the leanest-fastest possible (a 4-bit output head squeezes out a few more tok/s, at a fidelity cost), and not the most faithful possible (see the base-model fidelity link in ยง04) โ it's the point where speed and quality meet best. The Qwen (ChatML) chat template is baked into the GGUF โ just pass --jinja.
Run from the folder holding the .gguf:
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf \
--alias nex-n2-mini \
--host 0.0.0.0 \
--port 8080 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-c 131072 \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-ctk f16 \
-ctv f16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--jinja \
--parallel 1 \
--metrics \
--no-mmap
--spec-* / --spec-type draft-mtp flags โ Nex-N2-mini ships without an MTP head (non-speculative). At ~72 t/s it doesn't need speculative decoding to be quick.
Nex-N2-mini is an agentic / "thinking" coder โ agentic tool-use trained. To get native tool calls, your client must use the qwen3_coder tool-call parser. Without it the model tends to narrate code instead of emitting structured tool calls.
This is the best speed/quality balance in ROCmFP4 โ by design, not the absolute fastest. It keeps the two quality levers that are actually felt โ genuine f16 token embeddings and a Q6_K output head โ on the fast single-scale q4_0_rocmfp4_fast body. A leaner 4-bit-output-head build is a few tok/s faster but degrades fidelity you'll notice; an all-dual-scale body buys a KL improvement that sits inside the measurement noise while costing decode speed. The fast body + f16 embeddings + Q6 head is the point where those meet best.
How we landed on this recipe. We ran the full body-kernel / head-precision / dual-scale sweep โ KL divergence vs the BF16 reference plus llama-bench decode โ on the dense Qwen3.6-27B sibling, where the same q4_0_rocmfp4 levers apply. The frontier there was unambiguous: the all-dual-scale body and selective higher-precision tensors both traded decode speed for a KL gain inside the noise, so the fast body + f16 embeddings + Q6 head won the balance. We carry that conclusion to this MoE rather than re-running the whole sweep per model โ see the 27B sweep for the numbers and the format-limit reasoning. (Directional internal measurements โ reproduce before citing.)
The imatrix โ code-weighted, and measured (it helps here). Quantized with an importance matrix from a code-weighted calibration mix (~2.6:1 code:general โ eaddario code + Kalomaze groups_merged via froggeric/imatrix). Measured by KL-divergence + perplexity vs the true BF16 on a held-out code slice (disjoint from calibration):
For this model the imatrix is a clean win โ better on every metric, including perplexity. (It's model-dependent โ on the dense Qwopus-Coder the same recipe worsened code-PPL, so we shipped that one without imatrix. Always measure.)
# code-weighted imatrix on the BF16 (single pass)
llama-imatrix -m Nex-N2-mini-bf16.gguf -f code-weighted-calib.txt -o nexn2.imatrix -c 512 -ngl 999
# quant -> ROCmFP4 with the imatrix + genuine f16 embeddings
llama-quantize --token-embedding-type f16 --imatrix nexn2.imatrix \
Nex-N2-mini-bf16.gguf \
Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix.gguf Q4_0_ROCMFP4_STRIX
# THE ONE BUILD (โ
): add the Q6_K output head on the fast single-scale body โ best speed/quality balance (ยง04)
llama-quantize --token-embedding-type f16 --output-tensor-type q6_K --imatrix nexn2.imatrix \
Nex-N2-mini-bf16.gguf \
Nex-N2-mini-ROCmFP4-STRIX-embF16-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
Experimental research build for AMD Strix Halo โ hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
Derivative quantization โ verify the base model's license before redistribution / use.
- Downloads last month
- 2,998
16-bit
Model tree for plunderstruck/Nex-N2-mini-ROCmFP4-GGUF
Base model
nex-agi/Nex-N2-mini