Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Nex-N2-Pro โ€” OneComp AutoBit sub-4-bit (avg 3.2 bits/weight)

A sub-4-bit, mixed-precision (2/3/4/8-bit) quantization of nex-agi/Nex-N2-Pro (a ~397B-parameter MoE model, ~17B active, derived from Qwen3.5-397B-A17B).

The weights average 3.2 bits/weight (~3.35 bpw effective incl. group overhead), shrinking the model from 740 GB (BF16) to **161 GB** so the full model fits and serves on a single workstation with 2ร—96 GB GPUs.

โš ๏ธ This checkpoint uses a custom mixed-bit MoE format and does NOT load with stock vLLM / Transformers / llama.cpp. Serving requires the mixed-bit MoE plugin (mmzz164/OneCompression, MIT) โ€” see How to serve. If you want a plug-and-play quant, use one of the standard GGUF/AWQ builds instead.


TL;DR

Base model nex-agi/Nex-N2-Pro (โ‰ˆ397B MoE, A17B; built on Qwen3.5-397B-A17B)
Method OneComp AutoBit (ILP mixed 2/3/4/8-bit) + QEP + MSE clipping + act_order
Avg bits 3.2 bpw (โ‰ˆ3.35 effective)
Size ~161 GB (41 safetensors shards)
Quality (WikiText, EN) BF16 3.773 โ†’ 4.177 PPL (+10.7%)
Throughput ~18.8 tok/s on 2ร— RTX PRO 6000 (Blackwell)
Min hardware ~192 GB total VRAM (e.g. 2ร—96 GB)

Why mixed-precision sub-4-bit?

Uniform 4-bit formats (e.g. NVFP4 โ‰ˆ 4.5 bpw effective) put this 397B model at ~220โ€“240 GB โ€” too large for 192 GB of VRAM. Going sub-4-bit is the only way to fit it. OneComp AutoBit spends bits where they matter (per-module ILP allocation: 8-bit for sensitive projections, 2-bit for robust/rarely-routed experts) and uses GPTQ-family error propagation (QEP) plus MSE clipping and act_order to hold quality at this bit budget.

Quantization details

  • Bit allocation โ€” OneComp AutoBit solves a per-module mixed-integer allocation over {2,3,4,8}-bit GPTQ quantizers (group size 128) to hit the 3.2-bpw target, weighting by activation-aware error. Guards: an RTN fallback for numerically degenerate layers and a 4-bit floor for non-expert (attention) projections.
  • Rounding โ€” GPTQ with Quantization Error Propagation (QEP) (Fujitsu Research), MSE range-clipping, and act_order (static-groups variant, so the packed layout stays kernel-compatible).
  • Post-quant surgery โ€” members of a vLLM fused group are made bit-uniform (low member promoted to 8-bit RTN), and the shared-expert gate/up projections are kept in BF16 (the vLLM loader fuses and skips them otherwise).
  • Kept in BF16 โ€” vision tower, embeddings, LM head, router gates, norms.

Evaluation

WikiText perplexity, measured against the BF16 model under the same harness:

WikiText PPL (EN) ฮ” vs BF16
BF16 (base) 3.773 โ€”
This model (3.2 bpw) 4.177 +10.7%

Note: perplexity is a proxy. Downstream task behavior was spot-checked (arithmetic, code, EN/JA translation, "thinking" mode) and looked consistent with the base model, but no formal benchmark suite was run on this quantized checkpoint.

Performance

On 2ร— RTX PRO 6000 (Blackwell, 96 GB each) with a custom grouped fused-MoE Triton kernel + CUDA graphs: ~18.8 tok/s decode (greedy outputs are bit-identical to the non-graph reference path). Throughput was measured with a tight VRAM config (short max_model_len); longer context trades against the VRAM headroom.

Hardware requirements

  • ~192 GB total VRAM (validated on 2ร—96 GB Blackwell). The ~161 GB of weights leave only a small KV-cache/activation budget on 192 GB, so context length and concurrency are limited. Cards with more VRAM relax this.
  • Kernel was built/tested on CUDA 13.0 (Blackwell); other GPUs may need kernel retuning.

How to serve

This is a custom mixed-bit MoE checkpoint. Stock loaders cannot read it. To serve:

  1. Install OneComp (MIT) โ€” the quantization/serving base.
  2. Add the mixed-bit grouped fused-MoE serving plugin (the part that actually loads and runs per-expert 2/3/4/8-bit MoE โ€” not in upstream) from mmzz164/OneCompression: mixed_moe.py (the mixed_gptq FusedMoEMethodBase), plus grouped_moe.py and fused_dq_gemm.py (the Triton kernels). See MIXED_BIT_MOE_SERVING.md.
  3. Serve with vLLM using pipeline-parallel across your two GPUs and the mixed_gptq quantization, e.g. (illustrative):
    vllm serve <path-to-this-model> \
      --quantization mixed_gptq \
      --pipeline-parallel-size 2 \
      --trust-remote-code
    
    Exact flags (PP layer partition, gpu_memory_utilization, max_model_len, CUDA-graph capture sizes) depend on your VRAM; see MIXED_BIT_MOE_SERVING.md.

If you only need the model's capabilities and not this specific compression, prefer a standard quantization of the base model.

Limitations

  • Not a standard format โ€” requires the OneComp mixed_gptq plugin; will not load in vanilla vLLM / Transformers / llama.cpp / Ollama.
  • High VRAM floor (~192 GB) โ€” out of reach for most single machines.
  • Vision path not validated โ€” the base is a vision-language model; the vision tower is kept in BF16 but the imageโ†’text path was not evaluated after quantization.
  • Lossy โ€” +10.7% WikiText PPL vs BF16. Within a "quality-preserving" band for general use, but not lossless.

License & attribution

Licensed under Apache 2.0, inherited through the lineage:

Qwen3.5-397B-A17B (Apache 2.0) โ†’ nex-agi/Nex-N2-Pro (Apache 2.0) โ†’ this quantization (Apache 2.0).

Per Apache 2.0, this is a derivative work with the following significant change: post-training weight quantization to ~3.2 bits/weight using OneComp AutoBit + QEP + MSE clipping + act_order. Please retain attribution to Qwen and Nex-AGI, and propagate any upstream NOTICE files.

This model's capabilities, training, and behavior are entirely those of the base model nex-agi/Nex-N2-Pro; only the weight precision was changed.

Acknowledgements

Citation

If you use this checkpoint, please cite the base model and OneComp:

@misc{nex-n2-pro-onecomp-3p2bit,
  title  = {Nex-N2-Pro โ€” OneComp AutoBit sub-4-bit (3.2 bpw) quantization},
  note   = {Derivative of nex-agi/Nex-N2-Pro; quantized with Fujitsu OneComp (AutoBit + QEP)},
  year   = {2026}
}
Downloads last month
406
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aquaman164/Nex-N2-Pro-AutoBit-3.2bpw

Quantized
(29)
this model