Nex-N2-Pro optimized for MLX. This is one of the best coding models that runs on a Mac Studio!

  • A mixed-precision quant that balances speed, memory, and accuracy.
  • 4-bit baseline with important layers at higher precision.
  • Supports image input and requires a vision-capable MLX server.

Usage

# Start server at http://localhost:8080/v1/chat/completions
uvx --from mlx-vlm mlx_vlm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Nex-N2-Pro-MLX-5.3bit-vision

Benchmarks

Tested on a Mac Studio M3 Ultra.

metric this model
bpw 5.349
base memory 246.796
peak memory (1024/512) 267.043
prompt tok/s (1024) 475.490 ± 0.195
gen tok/s (512) 30.802 ± 0.154
kl mean* 0.012 ± 0.001
kl p95* 0.029 ± 0.001
perplexity 3.677 ± 0.023
ifbench_strict 0.470 ± 0.050
ifbench_loose 0.520 ± 0.050
arc_challenge 0.696 ± 0.021
hellaswag 0.922 ± 0.012

*KL was measured against the largest quant I could run (~495GB), so real value is higher.

Methodology

Quantized with a mlx-vlm fork. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision

Related tooling:

Downloads last month
-
Safetensors
Model size
74B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spicyneuron/Nex-N2-Pro-MLX-5.3bit-vision

Quantized
(2)
this model