Nex-N2-mini-MLX-VLM-4bit-MTP

Native MLX-VLM 4-bit quantized version of nex-agi/Nex-N2-mini, with a grafted MTP head for oMLX Native MTP speculative decoding.

Summary

  • Base model: nex-agi/Nex-N2-mini
  • Format: native MLX / MLX-VLM
  • Quantization: 4-bit MLX-VLM quantization
  • Vision: supported
  • MTP: included
  • Target runtime: oMLX with Native MTP enabled
  • Direct mlx-vlm.generate: not the supported runtime for this MTP variant

What changed in this version

This repository uses the native MLX-VLM trunk and vision weights from Nex-N2-mini-MLX-VLM-4bit, while the MTP head is grafted from the jedisct1/Nex-N2-mini-mlx-OptiQ-4bit-MTP MTP variant.

This is not the same as jedisct1/Nex-N2-mini-mlx-OptiQ-4bit-MTP:

  • This repo keeps the native MLX-VLM layout.
  • This repo includes the vision tower / VLM weights.
  • The jedisct1 OptiQ MTP repository is text-only.
  • The MTP head is used only for speculative decoding acceleration.

Important compatibility note

This MTP variant includes language_model.mtp.* weights intended for oMLX Native MTP.

The validated runtime is:

  • oMLX
  • Native MTP enabled in model settings

Plain mlx-vlm.generate or generic MLX loading paths may fail strict loading or may not use the MTP head correctly, because the preserved language_model.mtp.* tensors are intended for the oMLX Native MTP runtime.

For direct mlx-vlm.generate usage, use the non-MTP variant instead:

joowon-jang/Nex-N2-mini-MLX-VLM-4bit

Quality and behavior

The MTP head is used for speculative decoding. Draft tokens are verified by the main Nex-N2-mini trunk before being accepted, so the MTP head is intended to affect speed rather than final output quality.

In practice, MTP speedups depend on draft acceptance rate. The grafted head tends to help more on normal prose and reasoning, and less on unusual token sequences.

Recommended runtime

Use oMLX and enable Native MTP.

Suggested initial settings:

  • Native MTP: enabled
  • Max Draft Tokens: 2
  • Min Draft Tokens: 1
  • Temperature: 0 for benchmarking
  • Use the same prompt, context length, and max tokens when comparing against non-MTP variants

Notes

This is not an OptiQ oQ4 sidecar model. The model uses a native MLX-VLM layout with vision_tower.* weights included in the model files.

MTP head attribution:

  • MTP head source: jedisct1/Nex-N2-mini-mlx-OptiQ-4bit-MTP
  • Original base model: nex-agi/Nex-N2-mini
  • Donor MTP lineage described by the source model: Qwen3.5-35B-A3B MTP head grafted onto Nex-N2-mini-compatible dimensions

License

Apache-2.0, following the base model and referenced MTP-head source license.

Downloads last month
238
Safetensors
Model size
6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joowon-jang/Nex-N2-mini-MLX-VLM-4bit-MTP

Quantized
(52)
this model

Collection including joowon-jang/Nex-N2-mini-MLX-VLM-4bit-MTP