Instructions to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Frosty40/Nex-N2-mini-Turbo-Phase-Twin", filename="Nex-N2-mini-Turbo-Phase-Twin.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16 # Run inference directly in the terminal: llama-cli -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16 # Run inference directly in the terminal: llama-cli -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16 # Run inference directly in the terminal: ./llama-cli -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Use Docker
docker model run hf.co/Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
- LM Studio
- Jan
- vLLM
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Frosty40/Nex-N2-mini-Turbo-Phase-Twin" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Frosty40/Nex-N2-mini-Turbo-Phase-Twin", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
- Ollama
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Ollama:
ollama run hf.co/Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
- Unsloth Studio
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Frosty40/Nex-N2-mini-Turbo-Phase-Twin to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Frosty40/Nex-N2-mini-Turbo-Phase-Twin to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Frosty40/Nex-N2-mini-Turbo-Phase-Twin to start chatting
- Pi
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Docker Model Runner:
docker model run hf.co/Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
- Lemonade
How to use Frosty40/Nex-N2-mini-Turbo-Phase-Twin with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Frosty40/Nex-N2-mini-Turbo-Phase-Twin:F16
Run and chat with the model
lemonade run user.Nex-N2-mini-Turbo-Phase-Twin-F16
List all available models
lemonade list
Nex-N2-mini — Turbo Phase Twin
One file. Two precision phases. Flip --sm.
A single GGUF of Nex-N2-mini — the Qwen3.5-35B-A3B MoE (~3B active, multimodal reasoning) by the Nex-AGI team — that carries two expert precisions at once and hot-swaps between them at load with one flag:
--sm |
phase | GPUs | experts | resident | decode (B70) | accuracy |
|---|---|---|---|---|---|---|
| off | IQ3 (default) | 1× 16 GB | IQ3_A770 3.19 bpw (codebook-free) | ~15.0 GiB | ~82 t/s | KLD 0.0547 / top-1 89.9% |
| on | Q4 | 2× 16 GB | Q4_K 4.5 bpw, split -sm layer |
~18.8 GiB | bandwidth-positive¹ | KLD 0.0245 / top-1 93.2% (near-lossless) |
The two phases are twins in one ~34 GiB file: the loader keeps only the phase you select
and reads only its bytes, so single-card users pay for 15 GiB and two-card users pay for the
Q4 split — never both. --sm off runs the whole model on one 16 GB Arc A770 at ~82 tok/s.
⚠️ This model requires a custom llama.cpp build (codebook-free
IQ3_A770type + the multi-precision loader). Stock llama.cpp cannot load it — it will report an unknown tensor type. The build is one patch on a pinned base; see Install below. Everything you need is in this repo.
What it is
Nex-N2-mini is a 256-expert MoE (8 active/token, ~34.7 B total / ~3 B active, hybrid
linear-attention, multimodal). Experts are 92.9% of the weights, so the whole speed/size/
accuracy story is about how the experts are quantized.
This "Phase Twin" packages two answers in one file:
- IQ3 phase — a bespoke codebook-free 3-bit expert quant (
IQ3_A770, 3.19 bpw) built to fit the entire model on a single Arc A770 16 GB and decode fast through a fused, reordered dp4a MoE path. This is the headline: a 35B-class MoE, one consumer card, ~82 t/s. - Q4 phase — the same experts at Q4_K 4.5 bpw, meant to be split across two cards
(
-sm layer). It is near-lossless (within 0.24% perplexity of the Q6_K reference) and faster per card, because splitting halves the bytes each GPU reads per token.
You pick the phase with --sm; nothing is re-downloaded or re-quantized.
¹ Why two cards is also faster, not just more accurate. Decode reads the active experts
(8 of 256) each token; split across two cards the per-card bytes ≈ (active/2) × bpw. The
break-even vs the one-card 3.19 bpw base is ~6.4 bpw — so Q4_K (4.5) reads fewer bytes per
card than the 1-card 3-bit base: the split buys accuracy and relieves decode bandwidth.
(Single-card B70 decode is measured; two-card throughput is bandwidth-projected — validate
on real 2-GPU hardware.)
Quality (KLD vs a Q6_K reference, wikitext-2, 100×512 tok)
| phase | bpw | mean KLD ↓ | top-1 ↑ | PPL(Q)/PPL(ref) |
|---|---|---|---|---|
| IQ3 (1-card) | 3.19 | 0.0547 | 89.91% | 1.0163 |
| Q4 (2-card) | 4.5 | 0.0245 | 93.24% | 1.0024 |
The Q4 phase more than halves KLD (−55%) and sits within 0.24% PPL of the full Q6_K base — effectively lossless — while top-1 agreement rises 89.9% → 93.2%.
Speed (Intel Arc Pro B70, A770-proxy — measured, single-card)
The 1-card IQ3 decode rests on a fused, reordered dp4a MoE path. Measured back-to-back, same binary, prefill-matched (so it's the kernel, not scheduler contention):
| reorder path | pp512 (control) | tg128 |
|---|---|---|
| dormant (fallback) | 599.7 | 43.71 ± 0.04 |
| live (this build) | 600.2 | 82.3 ± 0.05 |
+88% decode from the fused reorder stack, reproducible to 0.2%.
Install
You need llama.cpp built with the SYCL backend (Intel oneAPI) and this repo's single patch. One base commit, one patch — the result is the exact validated code.
# 0) Intel oneAPI (icpx + MKL/DPCPP) installed and on PATH:
source /opt/intel/oneapi/setvars.sh
# 1) llama.cpp at the pinned base + this repo's patch
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git checkout f0156d1401500512ad85042ccf38970568b12253
git apply /path/to/llama.cpp-turbo-phase-twin.patch # from this repo's build/
# 2) build (SYCL)
cmake -B build -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx
cmake --build build -j --target llama-server llama-cli llama-bench
build/build.sh does all of the above in one shot. Full methodology, the per-feature patch
stack (0001–0010), and reproducible benchmarks live at
https://github.com/newjordan/NexN2-B70-Turbo.
Operation
Use the included launcher (auto-detects GPU count) — it picks the phase and adds the right flags for you:
./run-nx2.sh --sm auto # 2+ GPUs → Q4 split; else 1-card IQ3
./run-nx2.sh --sm off -- -p "Hello" -n 64 # force 1-card IQ3
NX2_TOOL=llama-server ./run-nx2.sh --sm on --port 8090 # force 2-card Q4
Or drive llama.cpp directly:
# IQ3 phase (default, 1 card)
llama-server -m Nex-N2-mini-Turbo-Phase-Twin.gguf -ngl 99 --port 8090 --jinja
# Q4 phase (2 cards) — select variant 1 + split layer-wise
llama-server -m Nex-N2-mini-Turbo-Phase-Twin.gguf -ngl 99 -sm layer -ts 1,1 \
--override-kv general.tensor_variant.default=int:1 --port 8090 --jinja
OpenAI-compatible endpoint at http://127.0.0.1:8090/v1. Nex-N2-mini is a reasoning
model — it emits a <think> trace, so allow generous max_tokens. Native recommended
sampling (per Nex-AGI): temperature 0.7, top_p 0.95, top_k 40. For vision, add
--mmproj mmproj-f16.gguf.
How the twin works (one file, one selected phase)
The GGUF stores each expert tensor twice: the canonical IQ3 tensor and a <name>.v1 Q4_K
sibling. The patched loader reads general.tensor_variant.default (default 0 = IQ3;
override to 1 for Q4_K), keeps that variant, drops the other before allocation, and
renames the survivor to the canonical name — so the rest of llama.cpp sees an ordinary
single-precision model and only the selected phase's bytes are loaded. No-op on normal
single-precision GGUFs.
Elastic: auto-fit any VRAM (one command)
--sm picks one of two global phases. The same file is also elastic: it carries a
per-tensor importance ranking (general.tensor_variant.promote_order), and the loader can
solve a VRAM-budget knapsack — start every expert at IQ3 and promote the most important
ones to Q4_K, by imatrix-importance-per-byte, until the footprint fits your card. The mix it
picks sits on a convex quality curve: spending an A770's spare ~0.9 GiB of single-card
headroom closes ~34 % of the IQ3→Q4_K KLD gap, on one card, same kernels, no re-download.
# elastic at load time — precision dialed to a budget (MiB of weights)
llama-cli -m Nex-N2-mini-Turbo-Phase-Twin.gguf \
--override-kv general.tensor_variant.budget_mb=int:15900 -p "Hello" -n 64
# OR one command: detect VRAM → pick budget → AUTOPRUNE to a single right-sized file
./install-nx2.sh # autodetect
./install-nx2.sh --vram-gib 16 # force A770 16 GB → ~15.4 GiB fitted model, excess dropped
./install-nx2.sh --keep-dual # keep the elastic dual; tune per-run instead
install-nx2.sh calls prune-dual.py, which runs the same selection as the loader and
writes a plain single-precision GGUF — so you can shrink the 34 GiB file to exactly what your
card needs and drop the unused bytes. budget_mb absent/0 → ordinary --sm behavior.
The in-engine
--override-kv budget_mbpath needs no Python. The autoprune tool reads the file with this project's IQ3_A770-awaregguf-py(the one yourbuild.shpatched) — if it isn't at~/llama.cpp, point at it:LLAMA_CPP=/path/to/llama.cpp ./install-nx2.sh.
Files
| file | what |
|---|---|
Nex-N2-mini-Turbo-Phase-Twin.gguf |
the model — both phases, ~34 GiB (loads only the selected phase) |
mmproj-f16.gguf |
vision projector (optional, for multimodal) |
run-nx2.sh |
--sm launcher (off=1-card IQ3, on=2-card Q4) |
install-nx2.sh |
one-command: detect VRAM → pick budget → autoprune → fitted launcher |
prune-dual.py |
autoprune the dual to a single right-sized GGUF for a VRAM budget |
build/llama.cpp-turbo-phase-twin.patch |
the single build patch (applies on base f0156d140) |
build/build.sh |
clone + checkout + apply + build, one command |
build/BUILD.md |
build notes & troubleshooting |
Provenance & changes (Apache-2.0 §4(b))
- Base model: nex-agi/Nex-N2-mini (Apache-2.0), post-trained on Qwen/Qwen3.5-35B-A3B-Base (Apache-2.0).
- Changes: converted to GGUF; experts quantized two ways — a codebook-free 3-bit type
(
IQ3_A770) andQ4_K— and packed as a dual-variant ("twin") GGUF; non-expert tensors atQ6_K. Both quant sets use a llama.cpp importance matrix (imatrix) computed from Bartowski'scalibration_datav3. GGUF metadataqwen35moe.block_count=40andqwen35moe.nextn_predict_layers=0set so the model loads in llama.cpp (the MTP/NextN head is speculative-only and absent from the checkpoint — lossless for standard inference). - The weights are quantizations of the original; no other modifications.
Credits & dependencies
This model is Nex-N2-mini, built by the
Nex-AGI team. The architecture, training, reasoning, and multimodal
capability are entirely theirs — this project only quantizes their model to run fast on Intel
Arc. All the intelligence here is Nex-AGI's, and sincere thanks to them for releasing a
genuinely capable open MoE. Full attribution in NOTICE; the chain in brief:
- ⭐ Nex-N2-mini © Nex-AGI (Apache-2.0) — the model these quants are built from, including the multimodal vision projector shipped here. → nex-agi.com · GitHub · HuggingFace
- Qwen3.5-35B-A3B-Base © Qwen Team / Alibaba (Apache-2.0) — the foundation model Nex-N2-mini was post-trained on.
- llama.cpp / ggml © the ggml authors (MIT) — the inference engine, quant framework, and SYCL backend. The included build patch is a derivative work of it (kept under MIT).
- k-quants by Iwan Kawrakow (ikawrakow) & the ggml authors (in llama.cpp, MIT) —
IQ3_A770reuses Q3_K's superblock bit-packing;Q6_Kcovers non-expert tensors. - NormalFloat / NF4 (Dettmers et al., QLoRA, 2023) — design influence for the non-uniform, zero-centered code points; the table & codebook-free packing are this project's.
calibration_datav3by Bartowski — the imatrix calibration text.- WikiText-2 (Merity et al., 2016; CC BY-SA 3.0) — used only to measure PPL/KLD.
License & attribution
Released under the Apache License 2.0, inherited from the base model; this project's own
code (patch, launcher, build scripts) is MIT. Retain the attribution above. See
NOTICE for the complete chain.
@misc{qwen3.5,
title = {{Qwen3.5}: Towards Native Multimodal Agents},
author = {{Qwen Team}},
month = {February},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.5}
}
Nex-AGI's Nex-N2-mini (https://github.com/nex-agi/Nex-N2) Is the model, this is just a kernel optimization for efficiency.
- Downloads last month
- 538
We're not able to determine the quantization variants.
Model tree for Frosty40/Nex-N2-mini-Turbo-Phase-Twin
Base model
nex-agi/Nex-N2-mini