Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Nex-N2-Pro — OneComp AutoBit sub-4-bit (avg 3.2 bits/weight)

A sub-4-bit, mixed-precision (2/3/4/8-bit) quantization of nex-agi/Nex-N2-Pro (a ~397B-parameter MoE model, ~17B active, derived from Qwen3.5-397B-A17B).

The weights average 3.2 bits/weight (~3.35 bpw effective incl. group overhead), shrinking the model from 740 GB (BF16) to **161 GB** so the full model fits and serves on a single workstation with 2×96 GB GPUs.

⚠️ This checkpoint uses a custom mixed-bit MoE format and does NOT load with stock vLLM / Transformers / llama.cpp. Serving requires the mixed-bit MoE plugin (mmzz164/OneCompression, MIT) — see How to serve. If you want a plug-and-play quant, use one of the standard GGUF/AWQ builds instead.

TL;DR


Base model	nex-agi/Nex-N2-Pro (≈397B MoE, A17B; built on Qwen3.5-397B-A17B)
Method	OneComp AutoBit (ILP mixed 2/3/4/8-bit) + QEP + MSE clipping + act_order
Avg bits	3.2 bpw (≈3.35 effective)
Size	~161 GB (41 safetensors shards)
Quality (WikiText, EN)	BF16 3.773 → 4.177 PPL (+10.7%)
Throughput	~18.8 tok/s on 2× RTX PRO 6000 (Blackwell)
Min hardware	~192 GB total VRAM (e.g. 2×96 GB)

Why mixed-precision sub-4-bit?

Uniform 4-bit formats (e.g. NVFP4 ≈ 4.5 bpw effective) put this 397B model at ~220–240 GB — too large for 192 GB of VRAM. Going sub-4-bit is the only way to fit it. OneComp AutoBit spends bits where they matter (per-module ILP allocation: 8-bit for sensitive projections, 2-bit for robust/rarely-routed experts) and uses GPTQ-family error propagation (QEP) plus MSE clipping and act_order to hold quality at this bit budget.

Quantization details

Bit allocation — OneComp AutoBit solves a per-module mixed-integer allocation over {2,3,4,8}-bit GPTQ quantizers (group size 128) to hit the 3.2-bpw target, weighting by activation-aware error. Guards: an RTN fallback for numerically degenerate layers and a 4-bit floor for non-expert (attention) projections.
Rounding — GPTQ with Quantization Error Propagation (QEP) (Fujitsu Research), MSE range-clipping, and act_order (static-groups variant, so the packed layout stays kernel-compatible).
Post-quant surgery — members of a vLLM fused group are made bit-uniform (low member promoted to 8-bit RTN), and the shared-expert gate/up projections are kept in BF16 (the vLLM loader fuses and skips them otherwise).
Kept in BF16 — vision tower, embeddings, LM head, router gates, norms.

Evaluation

WikiText perplexity, measured against the BF16 model under the same harness:

	WikiText PPL (EN)	Δ vs BF16
BF16 (base)	3.773	—
This model (3.2 bpw)	4.177	+10.7%

Note: perplexity is a proxy. Downstream task behavior was spot-checked (arithmetic, code, EN/JA translation, "thinking" mode) and looked consistent with the base model, but no formal benchmark suite was run on this quantized checkpoint.

Performance

On 2× RTX PRO 6000 (Blackwell, 96 GB each) with a custom grouped fused-MoE Triton kernel + CUDA graphs: ~18.8 tok/s decode (greedy outputs are bit-identical to the non-graph reference path). Throughput was measured with a tight VRAM config (short max_model_len); longer context trades against the VRAM headroom.

Hardware requirements

~192 GB total VRAM (validated on 2×96 GB Blackwell). The ~161 GB of weights leave only a small KV-cache/activation budget on 192 GB, so context length and concurrency are limited. Cards with more VRAM relax this.
Kernel was built/tested on CUDA 13.0 (Blackwell); other GPUs may need kernel retuning.

How to serve

This is a custom mixed-bit MoE checkpoint. Stock loaders cannot read it. To serve:

Install OneComp (MIT) — the quantization/serving base.
Add the mixed-bit grouped fused-MoE serving plugin (the part that actually loads and runs per-expert 2/3/4/8-bit MoE — not in upstream) from mmzz164/OneCompression: mixed_moe.py (the mixed_gptq FusedMoEMethodBase), plus grouped_moe.py and fused_dq_gemm.py (the Triton kernels). See MIXED_BIT_MOE_SERVING.md.
Serve with vLLM using pipeline-parallel across your two GPUs and the mixed_gptq quantization, e.g. (illustrative):
```
vllm serve <path-to-this-model> \
  --quantization mixed_gptq \
  --pipeline-parallel-size 2 \
  --trust-remote-code
```
Exact flags (PP layer partition, gpu_memory_utilization, max_model_len, CUDA-graph capture sizes) depend on your VRAM; see MIXED_BIT_MOE_SERVING.md.

If you only need the model's capabilities and not this specific compression, prefer a standard quantization of the base model.

Limitations

Not a standard format — requires the OneComp mixed_gptq plugin; will not load in vanilla vLLM / Transformers / llama.cpp / Ollama.
High VRAM floor (~192 GB) — out of reach for most single machines.
Vision path not validated — the base is a vision-language model; the vision tower is kept in BF16 but the image→text path was not evaluated after quantization.
Lossy — +10.7% WikiText PPL vs BF16. Within a "quality-preserving" band for general use, but not lossless.

License & attribution

Licensed under Apache 2.0, inherited through the lineage:

Qwen3.5-397B-A17B (Apache 2.0) → nex-agi/Nex-N2-Pro (Apache 2.0) → this quantization (Apache 2.0).

Per Apache 2.0, this is a derivative work with the following significant change: post-training weight quantization to ~3.2 bits/weight using OneComp AutoBit + QEP + MSE clipping + act_order. Please retain attribution to Qwen and Nex-AGI, and propagate any upstream NOTICE files.

This model's capabilities, training, and behavior are entirely those of the base model nex-agi/Nex-N2-Pro; only the weight precision was changed.

Acknowledgements

nex-agi for Nex-N2-Pro.
Alibaba Qwen for the Qwen3.5 foundation.
Fujitsu Research for OneComp / OneCompression (MIT), including the AutoBit allocator and QEP.
Mixed-bit MoE serving kernels: mmzz164/OneCompression (MIT).

Citation

If you use this checkpoint, please cite the base model and OneComp:

@misc{nex-n2-pro-onecomp-3p2bit,
  title  = {Nex-N2-Pro — OneComp AutoBit sub-4-bit (3.2 bpw) quantization},
  note   = {Derivative of nex-agi/Nex-N2-Pro; quantized with Fujitsu OneComp (AutoBit + QEP)},
  year   = {2026}
}

Downloads last month: 406

Model tree for aquaman164/Nex-N2-Pro-AutoBit-3.2bpw

Base model

nex-agi/Nex-N2-Pro

Quantized

(29)

this model