APEX Vision Agentic ⚠ Template Fix

Nex-N2-mini

📖 中文文档

Agentic Vision MoE — APEX Quantized GGUF (Stock llama.cpp)

⚠️ Temporary Workaround — Not Official

👉 Want the unmodified, recommended version? Go to SC117/Nex-N2-mini-APEX-GGUF (use with Nex's patched llama.cpp)

This is a temporary, unofficial workaround for stock llama.cpp users. The original chat_template.jinja has been replaced with a fixed version so that --reasoning-format works without --chat-template-file.

⚠️ The Nex team explicitly recommends against modifying the chat template. The model was trained strictly on the original template — deviating from the training-time format may degrade output quality. See discussion #3 for details.

The recommended approach is to use Nex's patched llama.cpp with the unmodified GGUF. Once Nex's upstream patch is merged into stock llama.cpp, these template-fixed GGUFs will be superseded.

Use this only if you cannot use Nex's patched llama.cpp and need thinking mode to work on stock builds. Be aware that output quality may differ from the original model.

💡 What is APEX?

These GGUF files are quantized using APEX, a MoE-aware mixed-precision quantization technique that outperforms standard quantization methods while being significantly smaller.

APEX beats Q8_0 perplexity at half the size — and even beats F16.

APEX classifies every tensor by its role — routed expert, shared expert, or attention — and applies a layer-wise precision gradient, giving the most sensitive edge layers higher precision and compressing the redundant middle layers more aggressively.

📦 Available Files
FileSizeBPWNote
Nex-N2-mini.BF16.gguf64.6 GB16.0Full precision reference
Nex-N2-mini-APEX-I-Quality.gguf21.3 GB5.23Highest quality, best accuracy
Nex-N2-mini-APEX-I-Balanced.gguf23.6 GB5.85Best all-rounder, recommended
Nex-N2-mini-APEX-I-Compact.gguf15.4 GB3.81Best quality/size ratio, 16GB VRAM
mmproj-Nex-N2-mini.F16.gguf858 MB-Vision projector (required for image/video)
original-chat-template.jinja7.9 KB-Original unmodified template — for reference / use with Nex's patched llama.cpp

⚠ All GGUF files above (except BF16 and mmproj) contain a modified chat_template.jinja. See warning above.

🧠 Model Details
ArchitectureQwen3.5 MoE (GatedDeltaNet + Full Attention) + Vision Encoder
Parameters35B total, 3B active per token
Experts256 routed experts, 8 active per token
Layers40 layers (30 linear_attn + 10 full_attn)
Context262,144 tokens
VisionImage + Video support (mmproj 858MB)
ThinkingQwen3-style think tags — works on stock llama.cpp via modified template
🚀 Usage (Stock llama.cpp)

Text only

./llama-server \ -m Nex-N2-mini-APEX-I-Quality.gguf \ -ngl 99 -ncmoe 19 -c 32768 \ --host 0.0.0.0 --port 8081

With vision

./llama-server \ -m Nex-N2-mini-APEX-I-Quality.gguf \ --mmproj mmproj-Nex-N2-mini.F16.gguf \ -ngl 99 -ncmoe 19 -c 32768 \ --host 0.0.0.0 --port 8081

No --chat-template-file needed — the fixed template is embedded in the GGUF. Thinking mode works out of the box. Add --mmproj mmproj-Nex-N2-mini.F16.gguf for vision. Replace Nex-N2-mini-APEX-I-Quality.gguf with your preferred quantization tier (I-Quality / I-Balanced / I-Compact). Recommended sampling: temperature 0.7, top_p 0.95, top_k 40, min_p 0.

📋 Original Model Benchmarks
BenchmarkScoreCategory
BrowseComp74.1Agent
SWE-Bench Verified74.4Coding
Terminal-Bench 2.160.7Coding
GPQA Diamond82.6Reasoning
IFEval89.1Instruction

From the original Nex-N2-mini model card (BF16, full precision).

Links

Downloads last month
586
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/Nex-N2-mini-template-fix-APEX-GGUF

Quantized
(49)
this model