Instructions to use Chanito91/Nex-N2-mini-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Chanito91/Nex-N2-mini-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Chanito91/Nex-N2-mini-GGUF", filename="Nex-N2-mini-IQ3_XXS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Chanito91/Nex-N2-mini-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS # Run inference directly in the terminal: llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS # Run inference directly in the terminal: ./llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS # Run inference directly in the terminal: ./build/bin/llama-cli -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Use Docker
docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
- LM Studio
- Jan
- vLLM
How to use Chanito91/Nex-N2-mini-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Chanito91/Nex-N2-mini-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Chanito91/Nex-N2-mini-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
- Ollama
How to use Chanito91/Nex-N2-mini-GGUF with Ollama:
ollama run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
- Unsloth Studio
How to use Chanito91/Nex-N2-mini-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Chanito91/Nex-N2-mini-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Chanito91/Nex-N2-mini-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Chanito91/Nex-N2-mini-GGUF to start chatting
- Pi
How to use Chanito91/Nex-N2-mini-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Chanito91/Nex-N2-mini-GGUF:IQ3_XXS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Chanito91/Nex-N2-mini-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Chanito91/Nex-N2-mini-GGUF with Docker Model Runner:
docker model run hf.co/Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
- Lemonade
How to use Chanito91/Nex-N2-mini-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Chanito91/Nex-N2-mini-GGUF:IQ3_XXS
Run and chat with the model
lemonade run user.Nex-N2-mini-GGUF-IQ3_XXS
List all available models
lemonade list
Nex-N2-mini IQ3_XXS
imatrix-calibrated IQ3_XXS of nex-agi/Nex-N2-mini. 13.6 GB on disk, fits in 15 GB GPU memory with room for context. smallest quant of this model on the hub as of 2026-06-09.
made it because i wanted to run nex-n2-mini on my laptop's AMD iGPU (15 GB GTT cap) and every existing quant was 14 GB+.
gets ~14 tok/s on CPU only (Ryzen 7 PRO 7735U, no GPU offload). vulkan offload pushes it higher.
architecture
| Base | Qwen3.5-35B-A3B-Base (post-trained by Nex AGI) |
| Architecture | qwen35moe |
| Total params | ~35B |
| Active params | ~3B per token |
| Experts | 256 total, 8 routed + 1 shared per token |
| Hidden size | 2048 |
| Trunk layers | 40 (MTP head not included — see below) |
| Train context | 262144 |
| Vocab | 248320 |
| Vision | not in this GGUF (text-only — see "stuff to know") |
file
| file | quant | size | bpw | notes |
|---|---|---|---|---|
Nex-N2-mini-IQ3_XXS.gguf |
IQ3_XXS | 13.6 GB | 3.14 | attention kept at Q4_K, FFN experts pushed to IQ3 |
imatrix.dat |
— | 183 MB | — | importance matrix, re-quantize from this if you want a different size |
patch_gguf.py |
— | 3.5 KB | — | fixes the MTP load error (see below) |
Modelfile |
— | 1 KB | — | for ollama create |
using it
needs a llama.cpp from after 2026-02-10 (when qwen35moe arch landed in PR #19468).
LM Studio: drop the gguf in ~/.lmstudio/models/<you>/Nex-N2-mini-GGUF/, load it. update the bundled llama.cpp runtime to 2.13+ if it refuses to load.
Ollama 0.19+:
ollama create nex-n2-mini -f Modelfile
ollama run nex-n2-mini "hi"
llama-cli (ChatML):
llama-cli -m Nex-N2-mini-IQ3_XXS.gguf -ngl 999 \
-p $'<|im_start|>system\nyou are helpful<|im_end|>\n<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n' \
-n 100
ollama before 0.19 wont work — too old to know the qwen35moe arch.
if you quantize it yourself
you'll hit:
missing tensor 'blk.39.nextn.eh_proj.weight'
took me forever to figure out. nex agi didn't release the MTP draft head weights with the public release, but config.json claims they exist, so the convert script writes "has MTP" into the GGUF header and llama.cpp's loader then refuses because it can't find the tensor.
two metadata values to flip:
qwen35moe.nextn_predict_layers: 1 → 0
qwen35moe.block_count: 41 → 40
patch_gguf.py in this repo does it. 4-byte edits, idempotent, takes 30 seconds. way faster than re-converting from safetensors (8h on a laptop).
stuff to know
- reasoning model — outputs contain
<think>...</think>blocks. handle them in your wrapper or strip them - no MTP speedup — weights aren't in the public release. inference works fine, you just don't get the speculative-decoding bonus
- text only — the base model's
config.jsonhasvision_config+ image/video token slots, but llama.cpp's qwen35moe converter is text-only (PR #19468 literally titled "no vision"). if you want vision, look at quants that ship anmmproj-*.ggufalongside - Q3 means ~1-3% benchmark drop vs Q4 — for chat and tool-calling i can't tell the difference. for code/math it'll be more noticeable
how i made it
huggingface-cli download nex-agi/Nex-N2-mini --local-dir source
python convert_hf_to_gguf.py source --outtype f16
python patch_gguf.py source/*-F16.gguf
llama-imatrix -m <f16.gguf> -f calibration.txt -o imatrix.dat --chunks 50
llama-quantize --imatrix imatrix.dat <f16.gguf> Nex-N2-mini-IQ3_XXS.gguf IQ3_XXS
calibration text = Pride and Prejudice from gutenberg + the nex-n2 README (~750 KB total, 50 chunks of 512 tokens). imatrix-aware quantizer kept attention tensors at Q4_K and pushed expert FFN weights down to IQ3 — ended up at 3.14 bpw avg.
imatrix.dat is in the repo if you want to re-quantize to IQ2_S, Q4_K_S, or anything else without redoing the calibration pass.
base model © Nex AGI · base architecture © Qwen team · llama.cpp © ggml-org. apache 2.0, same as the base.
- Downloads last month
- 179
3-bit
Model tree for Chanito91/Nex-N2-mini-GGUF
Base model
nex-agi/Nex-N2-mini