Nex-N2-mini GGUF (imatrix, fixed chat template)

Imatrix-calibrated GGUF quantizations of nex-agi/Nex-N2-mini for llama.cpp — with a fixed chat template so reasoning extraction and tool calling work out of the box (see below).

Nex-N2-mini is a 35B-total / ~3B-active MoE (256 experts, 8 active) with hybrid linear attention, vision input, and "Agentic Thinking" adaptive reasoning. Apache 2.0.

Looking for Blackwell-optimized files? See LibertAIDAI/Nex-N2-mini-NVFP4-GGUF — NVFP4 expert tensors with native tensor-core kernels on RTX 50-series / B100/B200, faster batched serving than Q4_K_M on those GPUs.

Why these quants? Fixed chat template

The upstream chat template prefills the assistant turn with '<think>' (no trailing newline) while rendering past assistant reasoning as '<think>\n…'. This inconsistency breaks llama.cpp's reasoning parser: the forced-open think block is never recognized, so the full chain-of-thought (plus a stray </think>) leaks into content instead of reasoning_content — on every llama.cpp build, regardless of --reasoning-format. Community GGUFs that embed the upstream template inherit this bug.

These files embed a corrected template (one added newline). With stock llama-server --jinja:

  • reasoning_content / content are separated correctly,
  • tool calls parse into structured tool_calls,
  • no extra flags needed.

Template fix: broken vs fixed API responses

All quants below (except Q8_0, which doesn't use it) were quantized with an importance matrix computed from the BF16 weights over a diverse ~64k-token calibration set (the imatrix file is included in this repo).

About LibertAI

LibertAI is a decentralized AI platform — private inference, an OpenAI-compatible API, and a chat UI, all running on community GPUs over Aleph Cloud instead of a single company's servers. No accounts required to chat, no logs sent home, and the same models you'd self-host are available behind a sovereign endpoint.

If you want to put this model (or any other) to work as an autonomous agent without running your own infrastructure, check out LiberClaw — Hermes-style agents hosted on Aleph Cloud with LibertAI inference. Free tier: 2 agents, no credit card, 5 minutes to deploy. Open source.

Files

File Size When to pick
Nex-N2-mini-IQ4_XS.gguf 18.7 GB Smallest — fits a 24 GB GPU with long context
Nex-N2-mini-Q4_K_M.gguf 21.2 GB Recommended — best size/quality balance
Nex-N2-mini-Q5_K_M.gguf 24.7 GB Higher quality, still fits 32 GB GPUs
Nex-N2-mini-Q6_K.gguf 28.5 GB Near-lossless
Nex-N2-mini-Q8_0.gguf 36.9 GB Highest quality (needs >32 GB VRAM or partial offload)
mmproj-Nex-N2-mini-F16.gguf 903 MB Required for image input — works with all of the above
Nex-N2-mini.imatrix 192 MB The importance matrix used (for making your own quants)

Usage

Text-only (CLI)

llama-cli -m Nex-N2-mini-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here"

Multimodal (server, vision + text)

llama-server \
  -m Nex-N2-mini-Q4_K_M.gguf \
  --mmproj mmproj-Nex-N2-mini-F16.gguf \
  -ngl 999 -c 32768 --jinja \
  --host 0.0.0.0 --port 8080

Then POST to /v1/chat/completions — reasoning arrives in reasoning_content, answers in content, tool calls in tool_calls. To disable thinking, set chat_template_kwargs: {"enable_thinking": false} in the request.

About the architecture

Nex-N2-mini is built on the Qwen3.5-MoE architecture (qwen35moe in GGUF): 40 layers, 3 of every 4 using linear attention with every 4th full attention, 256 routed experts (8 active) plus a shared expert. The upstream config declares a 1-layer MTP head, but the published checkpoints do not include MTP weights, so no MTP/speculative variant can be produced from public weights.

Sources & credits

  • Base model: nex-agi/Nex-N2-mini by Nex AGI — Apache 2.0
  • Calibration data for the imatrix: bartowski's calibration_datav3
  • Tooling: llama.cpp convert_hf_to_gguf.py, llama-imatrix, llama-quantize

License

Apache 2.0, inherited from the upstream model.

Downloads last month
442
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibertAIDAI/Nex-N2-mini-GGUF

Quantized
(36)
this model