Nex-N2-Pro-mlx-8bit

8-bit (affine, group-size 64) MLX quantization of nex-agi/Nex-N2-Pro (qwen3_5_moe, 397B-A17B). Produced with mlx_lm and verified serving distributed across a 4-node Apple-Silicon JACCL/RDMA cluster (coherent output, ~29 tok/s decode).

Highlights

  • Router precision preserved. Per the qwen3_5 quantization predicate, every layer's mlp.gate and mlp.shared_expert_gate are kept at 8-bit (the quantization map in config.json lists all 60 layers) — MoE routing stability is not degraded. A_log (GatedDeltaNet decay) stays fp32.
  • Tensor-parallel / pipeline ready. Runs under mlx_lm distributed serving (mlx.launch --backend jaccl); validated 4-node and 2-node.
  • Text serving. mlx_lm strips the vision tower at load and serves the text language_model (responses carry a separate reasoning field — it's a thinking model).

Use with MLX

pip install mlx-lm
mlx_lm.generate --model mlx-community/Nex-N2-Pro-mlx-8bit \
  --prompt "Write a Python function to merge two sorted lists." --max-tokens 512

Distributed (hostfile with the JACCL RDMA device matrix, per the MLX docs):

mlx.launch --backend jaccl --hostfile hostfile.json -- \
  python -m mlx_lm server --model mlx-community/Nex-N2-Pro-mlx-8bit --port 8080

Quantization details

Method MLX affine (mlx_lm)
Bits 8
Group size 64
Router / shared-expert gates kept 8-bit (predicate)
Size on disk ~392 GB (91 shards)
Architecture qwen3_5_moe, 60 layers (45 GatedDeltaNet linear + 15 full-attention), 512 experts, 262K ctx

Quantized from the full-precision nex-agi/Nex-N2-Pro weights. A distillation-aware 4-bit (DWQ) variant is in progress. Apache-2.0, inherited from the base model.

Downloads last month
238
Safetensors
Model size
396B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Nex-N2-Pro-mlx-8bit

Quantized
(29)
this model