Configuration Parsing Warning:Config file config.json cannot be fetched (too big)
Nex-N2-Pro โ OneComp AutoBit sub-4-bit (avg 3.2 bits/weight)
A sub-4-bit, mixed-precision (2/3/4/8-bit) quantization of nex-agi/Nex-N2-Pro (a ~397B-parameter MoE model, ~17B active, derived from Qwen3.5-397B-A17B).
The weights average 3.2 bits/weight (~3.35 bpw effective incl. group overhead),
shrinking the model from 740 GB (BF16) to **161 GB** so the full model fits and
serves on a single workstation with 2ร96 GB GPUs.
โ ๏ธ This checkpoint uses a custom mixed-bit MoE format and does NOT load with stock vLLM / Transformers / llama.cpp. Serving requires the mixed-bit MoE plugin (mmzz164/OneCompression, MIT) โ see How to serve. If you want a plug-and-play quant, use one of the standard GGUF/AWQ builds instead.
TL;DR
| Base model | nex-agi/Nex-N2-Pro (โ397B MoE, A17B; built on Qwen3.5-397B-A17B) |
| Method | OneComp AutoBit (ILP mixed 2/3/4/8-bit) + QEP + MSE clipping + act_order |
| Avg bits | 3.2 bpw (โ3.35 effective) |
| Size | ~161 GB (41 safetensors shards) |
| Quality (WikiText, EN) | BF16 3.773 โ 4.177 PPL (+10.7%) |
| Throughput | ~18.8 tok/s on 2ร RTX PRO 6000 (Blackwell) |
| Min hardware | ~192 GB total VRAM (e.g. 2ร96 GB) |
Why mixed-precision sub-4-bit?
Uniform 4-bit formats (e.g. NVFP4 โ 4.5 bpw effective) put this 397B model at ~220โ240 GB โ too large for 192 GB of VRAM. Going sub-4-bit is the only way to fit it. OneComp AutoBit spends bits where they matter (per-module ILP allocation: 8-bit for sensitive projections, 2-bit for robust/rarely-routed experts) and uses GPTQ-family error propagation (QEP) plus MSE clipping and act_order to hold quality at this bit budget.
Quantization details
- Bit allocation โ OneComp AutoBit solves a per-module mixed-integer allocation over {2,3,4,8}-bit GPTQ quantizers (group size 128) to hit the 3.2-bpw target, weighting by activation-aware error. Guards: an RTN fallback for numerically degenerate layers and a 4-bit floor for non-expert (attention) projections.
- Rounding โ GPTQ with Quantization Error Propagation (QEP) (Fujitsu Research), MSE range-clipping, and act_order (static-groups variant, so the packed layout stays kernel-compatible).
- Post-quant surgery โ members of a vLLM fused group are made bit-uniform (low member promoted to 8-bit RTN), and the shared-expert gate/up projections are kept in BF16 (the vLLM loader fuses and skips them otherwise).
- Kept in BF16 โ vision tower, embeddings, LM head, router gates, norms.
Evaluation
WikiText perplexity, measured against the BF16 model under the same harness:
| WikiText PPL (EN) | ฮ vs BF16 | |
|---|---|---|
| BF16 (base) | 3.773 | โ |
| This model (3.2 bpw) | 4.177 | +10.7% |
Note: perplexity is a proxy. Downstream task behavior was spot-checked (arithmetic, code, EN/JA translation, "thinking" mode) and looked consistent with the base model, but no formal benchmark suite was run on this quantized checkpoint.
Performance
On 2ร RTX PRO 6000 (Blackwell, 96 GB each) with a custom grouped fused-MoE
Triton kernel + CUDA graphs: ~18.8 tok/s decode (greedy outputs are
bit-identical to the non-graph reference path). Throughput was measured with a tight
VRAM config (short max_model_len); longer context trades against the VRAM headroom.
Hardware requirements
- ~192 GB total VRAM (validated on 2ร96 GB Blackwell). The ~161 GB of weights leave only a small KV-cache/activation budget on 192 GB, so context length and concurrency are limited. Cards with more VRAM relax this.
- Kernel was built/tested on CUDA 13.0 (Blackwell); other GPUs may need kernel retuning.
How to serve
This is a custom mixed-bit MoE checkpoint. Stock loaders cannot read it. To serve:
- Install OneComp (MIT) โ the quantization/serving base.
- Add the mixed-bit grouped fused-MoE serving plugin (the part that actually loads
and runs per-expert 2/3/4/8-bit MoE โ not in upstream) from
mmzz164/OneCompression:
mixed_moe.py(themixed_gptqFusedMoEMethodBase), plusgrouped_moe.pyandfused_dq_gemm.py(the Triton kernels). SeeMIXED_BIT_MOE_SERVING.md. - Serve with vLLM using pipeline-parallel across your two GPUs and the
mixed_gptqquantization, e.g. (illustrative):
Exact flags (PP layer partition,vllm serve <path-to-this-model> \ --quantization mixed_gptq \ --pipeline-parallel-size 2 \ --trust-remote-codegpu_memory_utilization,max_model_len, CUDA-graph capture sizes) depend on your VRAM; seeMIXED_BIT_MOE_SERVING.md.
If you only need the model's capabilities and not this specific compression, prefer a standard quantization of the base model.
Limitations
- Not a standard format โ requires the OneComp
mixed_gptqplugin; will not load in vanilla vLLM / Transformers / llama.cpp / Ollama. - High VRAM floor (~192 GB) โ out of reach for most single machines.
- Vision path not validated โ the base is a vision-language model; the vision tower is kept in BF16 but the imageโtext path was not evaluated after quantization.
- Lossy โ +10.7% WikiText PPL vs BF16. Within a "quality-preserving" band for general use, but not lossless.
License & attribution
Licensed under Apache 2.0, inherited through the lineage:
Qwen3.5-397B-A17B (Apache 2.0) โ nex-agi/Nex-N2-Pro (Apache 2.0) โ this quantization (Apache 2.0).
Per Apache 2.0, this is a derivative work with the following significant change:
post-training weight quantization to ~3.2 bits/weight using OneComp AutoBit + QEP +
MSE clipping + act_order. Please retain attribution to Qwen and Nex-AGI, and
propagate any upstream NOTICE files.
This model's capabilities, training, and behavior are entirely those of the base model nex-agi/Nex-N2-Pro; only the weight precision was changed.
Acknowledgements
- nex-agi for Nex-N2-Pro.
- Alibaba Qwen for the Qwen3.5 foundation.
- Fujitsu Research for OneComp / OneCompression (MIT), including the AutoBit allocator and QEP.
- Mixed-bit MoE serving kernels: mmzz164/OneCompression (MIT).
Citation
If you use this checkpoint, please cite the base model and OneComp:
@misc{nex-n2-pro-onecomp-3p2bit,
title = {Nex-N2-Pro โ OneComp AutoBit sub-4-bit (3.2 bpw) quantization},
note = {Derivative of nex-agi/Nex-N2-Pro; quantized with Fujitsu OneComp (AutoBit + QEP)},
year = {2026}
}
- Downloads last month
- 406
Model tree for aquaman164/Nex-N2-Pro-AutoBit-3.2bpw
Base model
nex-agi/Nex-N2-Pro