Nex-N2-mini IQ3_XXS

imatrix-calibrated IQ3_XXS of nex-agi/Nex-N2-mini. 13.6 GB on disk, fits in 15 GB GPU memory with room for context. smallest quant of this model on the hub as of 2026-06-09.

made it because i wanted to run nex-n2-mini on my laptop's AMD iGPU (15 GB GTT cap) and every existing quant was 14 GB+.

gets ~14 tok/s on CPU only (Ryzen 7 PRO 7735U, no GPU offload). vulkan offload pushes it higher.

architecture

Base Qwen3.5-35B-A3B-Base (post-trained by Nex AGI)
Architecture qwen35moe
Total params ~35B
Active params ~3B per token
Experts 256 total, 8 routed + 1 shared per token
Hidden size 2048
Trunk layers 40 (MTP head not included — see below)
Train context 262144
Vocab 248320
Vision not in this GGUF (text-only — see "stuff to know")

file

file quant size bpw notes
Nex-N2-mini-IQ3_XXS.gguf IQ3_XXS 13.6 GB 3.14 attention kept at Q4_K, FFN experts pushed to IQ3
imatrix.dat 183 MB importance matrix, re-quantize from this if you want a different size
patch_gguf.py 3.5 KB fixes the MTP load error (see below)
Modelfile 1 KB for ollama create

using it

needs a llama.cpp from after 2026-02-10 (when qwen35moe arch landed in PR #19468).

LM Studio: drop the gguf in ~/.lmstudio/models/<you>/Nex-N2-mini-GGUF/, load it. update the bundled llama.cpp runtime to 2.13+ if it refuses to load.

Ollama 0.19+:

ollama create nex-n2-mini -f Modelfile
ollama run nex-n2-mini "hi"

llama-cli (ChatML):

llama-cli -m Nex-N2-mini-IQ3_XXS.gguf -ngl 999 \
  -p $'<|im_start|>system\nyou are helpful<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n' \
  -n 100

ollama before 0.19 wont work — too old to know the qwen35moe arch.

if you quantize it yourself

you'll hit:

missing tensor 'blk.39.nextn.eh_proj.weight'

took me forever to figure out. nex agi didn't release the MTP draft head weights with the public release, but config.json claims they exist, so the convert script writes "has MTP" into the GGUF header and llama.cpp's loader then refuses because it can't find the tensor.

two metadata values to flip:

qwen35moe.nextn_predict_layers: 1 → 0
qwen35moe.block_count:          41 → 40

patch_gguf.py in this repo does it. 4-byte edits, idempotent, takes 30 seconds. way faster than re-converting from safetensors (8h on a laptop).

stuff to know

  • reasoning model — outputs contain <think>...</think> blocks. handle them in your wrapper or strip them
  • no MTP speedup — weights aren't in the public release. inference works fine, you just don't get the speculative-decoding bonus
  • text only — the base model's config.json has vision_config + image/video token slots, but llama.cpp's qwen35moe converter is text-only (PR #19468 literally titled "no vision"). if you want vision, look at quants that ship an mmproj-*.gguf alongside
  • Q3 means ~1-3% benchmark drop vs Q4 — for chat and tool-calling i can't tell the difference. for code/math it'll be more noticeable

how i made it

huggingface-cli download nex-agi/Nex-N2-mini --local-dir source
python convert_hf_to_gguf.py source --outtype f16
python patch_gguf.py source/*-F16.gguf
llama-imatrix -m <f16.gguf> -f calibration.txt -o imatrix.dat --chunks 50
llama-quantize --imatrix imatrix.dat <f16.gguf> Nex-N2-mini-IQ3_XXS.gguf IQ3_XXS

calibration text = Pride and Prejudice from gutenberg + the nex-n2 README (~750 KB total, 50 chunks of 512 tokens). imatrix-aware quantizer kept attention tensors at Q4_K and pushed expert FFN weights down to IQ3 — ended up at 3.14 bpw avg.

imatrix.dat is in the repo if you want to re-quantize to IQ2_S, Q4_K_S, or anything else without redoing the calibration pass.


base model © Nex AGI · base architecture © Qwen team · llama.cpp © ggml-org. apache 2.0, same as the base.

Downloads last month
179
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chanito91/Nex-N2-mini-GGUF

Quantized
(32)
this model