Nex-N2-mini NVFP4

NVFP4-quantized Nex-N2-mini (Qwen3.5-MoE-35B fine-tune) optimized for NVIDIA Blackwell (SM 12.1) serving via vLLM with FlashInfer CUTLASS kernels.

3.2× compression (70 GB BF16 → 22.1 GiB NVFP4) with zero quality loss on capability tests.

Quick Start (Docker)

The easiest way to serve this model — auto-downloads on first run:

docker run -d --name nex-n2-mini-nvfp4 \
  --gpus all \
  --shm-size=8g \
  -e HF_TOKEN=hf_xxxxx \
  -v nex-n2-model:/mnt/model \
  -p 8000:8000 \
  ghcr.io/r0b0tlab/nex-n2-mini-nvfp4:latest

Then query:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"r0b0tlab/nex-n2-mini-nvfp4","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'

Docker Compose:

echo "HF_TOKEN=hf_xxxxx" > .env
docker compose up -d

See the GitHub repo for full documentation, environment variables, and AGENTS.md.

Model Details

Property	Value
Base model	Nex-AGI/Nex-N2-mini
Architecture	Qwen3_5MoeForConditionalGeneration
Parameters	35B total / 3B active (MoE, 256 experts, top-8 routing)
Layers	40 (30 linear attention + 10 full attention)
Vocabulary	248,320 tokens
Vision encoder	ViT (27 blocks, 1152 hidden) — kept BF16
Original size	~70 GB (BF16)
NVFP4 size	~22.1 GiB

Quantization

Method: NVFP4 via NVIDIA ModelOpt 0.44.0
Scope: MLP-only (expert projections + shared expert)
Group size: 16
Calibration: 128 samples from CNN/DailyMail
KV cache: FP8 e4m3 with per-layer calibrated scales
Expert calibration: 10,240/10,240 (100%)

Included in checkpoint:

NVFP4 quantized MoE weights (256 experts × 40 layers)
BF16 attention weights (self_attn, linear_attn)
BF16 vision encoder weights
BF16 lm_head.weight
Calibrated FP8 KV cache scales (k: 0.016–0.038, v: 0.010–0.040, q: 0.043)

Benchmarks

Tested on NVIDIA GB10 (Blackwell SM 12.1), vLLM v0.22.0, FlashInfer CUTLASS NVFP4.

Throughput (llama-benchy, 3 runs per test)

Test	Throughput	Peak t/s
pp2048	1,974 t/s	—
tg128	33.35 t/s	38.33
pp2048 @ d4096	4,007 t/s	—
pp2048 @ d8192	4,793 t/s	—
pp2048 @ d16384	5,017 t/s	—

Decode is rock-stable at 32–33 t/s across all context depths (0–16K). Only 2.8% degradation.

Concurrency Scaling

Concurrency	Aggregate t/s	Per-request t/s	Power	Temp
C1	28.5	28.6	20.4 W	45°C
C2	51.6	25.8	18.4 W	46°C
C4	105.3	26.3	20.2 W	47°C
C8	185.5	23.2	22.1 W	48°C

6.5× scaling at C8. 8.42 t/s/W efficiency. Peak 23.3W at 48°C.

Capability Tests: 13/13 Passed

Math (3/3), Reasoning (3/3), Coding (3/3), Knowledge (2/2), Instruction (2/2).

Serving Stack

vLLM v0.22.0 (Docker, aarch64)
FlashInfer CUTLASS NVFP4 GEMM kernel + MoE backend
FP8 KV cache with per-layer calibrated scales (~10–15% throughput improvement)

Known Issues

max_num_batched_tokens must be >= 2096 (Mamba block alignment)
NVFP4 KV cache unavailable (requires torch.nvfp4, NVIDIA-internal only)
Vision encoder profiling takes ~4–5 min on first startup
Requires NVIDIA Blackwell GPU (SM 12.1) for NVFP4 kernels

License

Apache 2.0.

Downloads last month: 2

Safetensors

Model size

18B params

Tensor type

BF16

F8_E4M3

Model tree for r0b0tlab/nex-n2-mini-nvfp4

Base model

nex-agi/Nex-N2-mini

Finetuned

(2)

this model

r0b0tlab
/

nex-n2-mini-nvfp4