Nex-N2-Pro-397B-A17B-EXL3 Pareto Frontier

Pareto-frontier EXL3 quants for Nex N2 Pro 397B.

Pick a quant with lowest KL and ppl that suits your hardware. Quants are in separate model repositories.

Quant GiB GB bpw PPL KL(qโ†’o) KL(oโ†’q) Top-1 Top-2 Top-3 Top-4 Top-5
cpral 2bpw 97 106 2.04 3.623 0.1682 0.2205 87.1% 59.0% 34.2% 18.4% 9.5%
cpral 2.65bpw 124 134 2.65 3.342 0.0685 0.0749 92.3% 71.7% 49.7% 32.0% 19.7%
cpral 3bpw 144 155 3.04 3.334 0.0491 0.0522 93.4% 75.1% 54.2% 36.3% 23.1%
cpral 3.2bpw 151 162 3.20 3.313 0.0396 0.0418 94.3% 77.5% 57.7% 40.0% 26.6%
cpral 3.4bpw 163 174 3.40 3.299 0.0282 0.0293 95.1% 80.3% 61.8% 44.6% 30.8%
cpral 3.55bpw 168 180 3.55 3.288 0.0235 0.0239 95.5% 81.9% 64.4% 47.6% 33.6%
cpral 3.7bpw 177 189 3.70 3.292 0.0216 0.0221 95.7% 82.5% 65.4% 48.7% 34.7%
cpral 4bpw 192 204 4.04 3.293 0.0188 0.0193 95.9% 83.4% 66.9% 50.5% 36.5%
cpral 4.5bpw 213 226 4.50 3.281 0.0126 0.0129 96.7% 86.0% 71.3% 55.9% 42.0%
cpral 5bpw 240 254 5.04 3.282 0.0096 0.0097 97.1% 87.6% 74.3% 59.9% 46.3%
BF16 original 752 807 16.00 3.271 โ€” โ€” โ€” โ€” โ€” โ€”

Methodology

Methodology that I've used to create custom quants is documented in https://github.com/adamo1139/qwen397b-exl3 and is mostly reproducible (I may have manually overriden some auto-generated configs in a minor way). Custom override configs have been placed into model repositories of all quants produced using this method. I've reused Goldkoron per-module sensitivity from Qwen 3.5 397B in hopes that this model would have similar sensivity, but it seems to not quite be the case, as custom optimized quants do not outperform bigger baselines and are just an intermediate step between them.

Credits

Thanks to @mratsim for sharing his methodology. Thanks to @Goldkoron for sharing per-module KLD sensivity chart for Qwen 3.5 397B. Thanks to @turboderp for creating exllamav3.

Potential for future work

Future work could enable better quants by re-doing sensivity analysis for Nex N2 Pro, tweaking superlinear penalty, incorporating 6bpw and 7bpw baselines, and quantizing various experts to a variable degree as informed by REAP/REAM data. Methodology that was used here should be applicable for other MoE models like GLM 4.5 family and Qwen 3.5 122B too, and it might be applicable to GGUF ecosystem too

TODO

Quantization_bits metadata is misleading and is a constant 3.00 since it was just copied over from one of the baseline quants. head_bits might be incorrect in the same way.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cpral/Nex-N2-Pro-397B-A17B-exl3

Quantized
(21)
this model