Instructions to use Chanito91/Nex-N2-mini-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Chanito91/Nex-N2-mini-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Chanito91/Nex-N2-mini-GGUF",
	filename="Nex-N2-mini-IQ3_XXS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Chanito91/Nex-N2-mini-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
# Run inference directly in the terminal:
llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
# Run inference directly in the terminal:
llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
# Run inference directly in the terminal:
./llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Use Docker

docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

LM Studio
Jan

vLLM

How to use Chanito91/Nex-N2-mini-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Chanito91/Nex-N2-mini-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Chanito91/Nex-N2-mini-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Ollama
How to use Chanito91/Nex-N2-mini-GGUF with Ollama:
```
ollama run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
```

Unsloth Studio

How to use Chanito91/Nex-N2-mini-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Chanito91/Nex-N2-mini-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Chanito91/Nex-N2-mini-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Chanito91/Nex-N2-mini-GGUF to start chatting

How to use Chanito91/Nex-N2-mini-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Chanito91/Nex-N2-mini-GGUF:IQ3_XXS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Chanito91/Nex-N2-mini-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Chanito91/Nex-N2-mini-GGUF with Docker Model Runner:
```
docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
```

Lemonade

How to use Chanito91/Nex-N2-mini-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Chanito91/Nex-N2-mini-GGUF:IQ3_XXS

Run and chat with the model

lemonade run user.Nex-N2-mini-GGUF-IQ3_XXS

List all available models

lemonade list

Nex-N2-mini IQ3_XXS

imatrix-calibrated IQ3_XXS of nex-agi/Nex-N2-mini. 13.6 GB on disk, fits in 15 GB GPU memory with room for context. smallest quant of this model on the hub as of 2026-06-09.

made it because i wanted to run nex-n2-mini on my laptop's AMD iGPU (15 GB GTT cap) and every existing quant was 14 GB+.

gets ~14 tok/s on CPU only (Ryzen 7 PRO 7735U, no GPU offload). vulkan offload pushes it higher.

architecture


Base	Qwen3.5-35B-A3B-Base (post-trained by Nex AGI)
Architecture	`qwen35moe`
Total params	~35B
Active params	~3B per token
Experts	256 total, 8 routed + 1 shared per token
Hidden size	2048
Trunk layers	40 (MTP head not included — see below)
Train context	262144
Vocab	248320
Vision	not in this GGUF (text-only — see "stuff to know")

file

file	quant	size	bpw	notes
`Nex-N2-mini-IQ3_XXS.gguf`	IQ3_XXS	13.6 GB	3.14	attention kept at Q4_K, FFN experts pushed to IQ3
`imatrix.dat`	—	183 MB	—	importance matrix, re-quantize from this if you want a different size
`patch_gguf.py`	—	3.5 KB	—	fixes the MTP load error (see below)
`Modelfile`	—	1 KB	—	for `ollama create`

using it

needs a llama.cpp from after 2026-02-10 (when qwen35moe arch landed in PR #19468).

LM Studio: drop the gguf in ~/.lmstudio/models/<you>/Nex-N2-mini-GGUF/, load it. update the bundled llama.cpp runtime to 2.13+ if it refuses to load.

Ollama 0.19+:

ollama create nex-n2-mini -f Modelfile
ollama run nex-n2-mini "hi"

llama-cli (ChatML):

llama-cli -m Nex-N2-mini-IQ3_XXS.gguf -ngl 999 \
  -p $'<|im_start|>system\nyou are helpful<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n' \
  -n 100

ollama before 0.19 wont work — too old to know the qwen35moe arch.

if you quantize it yourself

you'll hit:

missing tensor 'blk.39.nextn.eh_proj.weight'

took me forever to figure out. nex agi didn't release the MTP draft head weights with the public release, but config.json claims they exist, so the convert script writes "has MTP" into the GGUF header and llama.cpp's loader then refuses because it can't find the tensor.

two metadata values to flip:

qwen35moe.nextn_predict_layers: 1 → 0
qwen35moe.block_count:          41 → 40

patch_gguf.py in this repo does it. 4-byte edits, idempotent, takes 30 seconds. way faster than re-converting from safetensors (8h on a laptop).

stuff to know

reasoning model — outputs contain <think>...</think> blocks. handle them in your wrapper or strip them
no MTP speedup — weights aren't in the public release. inference works fine, you just don't get the speculative-decoding bonus
text only — the base model's config.json has vision_config + image/video token slots, but llama.cpp's qwen35moe converter is text-only (PR #19468 literally titled "no vision"). if you want vision, look at quants that ship an mmproj-*.gguf alongside
Q3 means ~1-3% benchmark drop vs Q4 — for chat and tool-calling i can't tell the difference. for code/math it'll be more noticeable

how i made it

huggingface-cli download nex-agi/Nex-N2-mini --local-dir source
python convert_hf_to_gguf.py source --outtype f16
python patch_gguf.py source/*-F16.gguf
llama-imatrix -m <f16.gguf> -f calibration.txt -o imatrix.dat --chunks 50
llama-quantize --imatrix imatrix.dat <f16.gguf> Nex-N2-mini-IQ3_XXS.gguf IQ3_XXS

calibration text = Pride and Prejudice from gutenberg + the nex-n2 README (~750 KB total, 50 chunks of 512 tokens). imatrix-aware quantizer kept attention tensors at Q4_K and pushed expert FFN weights down to IQ3 — ended up at 3.14 bpw avg.

imatrix.dat is in the repo if you want to re-quantize to IQ2_S, Q4_K_S, or anything else without redoing the calibration pass.

Downloads last month: 179

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

3-bit

Model tree for Chanito91/Nex-N2-mini-GGUF

Base model

nex-agi/Nex-N2-mini

Quantized

(32)

this model