Qwen3.6-35B-A3B-MXFP4-MTP

Qwen3.6-35B-A3B quantized to native MXFP4 for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled.


Source	Qwen/Qwen3.6-35B-A3B
License	Apache-2.0, inherited from upstream
Format	MXFP4 (`mx.quantize`, affine, `group_size=32`)
Architecture	`qwen3_5_moe` — 40 layers, 256 routed experts, top-8, ~3B active
Modality	image + video + text
Context	262,144
Bundle size	21.53 GB
MTP	native head preserved, enabled (`num_nextn_predict_layers=1`)

Quantization

4-bit affine linears via MLX-native mx.quantize (mode="mxfp4", group_size=32). Norms, router gates, expert biases and the full vision tower are kept in fp16 passthrough (643 passthrough tensors). MTP linears are quantized to MXFP4; MTP norm/control tensors stay fp16. This is the smallest bundle in the MoE line — the same model as the MXFP8 variant at roughly 60% of the size.

Multi-Token Prediction

This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.

Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):

Draft depth	tok/s	Speedup
Baseline (MTP off)	83.9	1.00×
D1	108.8	1.30×
D2	126.0	1.50×
D3 (default)	131.2	1.56×

Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.

Vision, MTP and caching together

This bundle preserves the full Qwen3.6 VL tower alongside the native MTP head, so image/video input, MTP speculative decode and prefix/KV caching all run in the same session — a combination not every MTP-enabled Qwen build exposes. The VL stack is the same one verified on the MXFP8 sibling.

Loading

Loads via stock MLX tooling on Apple Silicon — the mxfp4 weights are native mx.quantize affine, no JANG runtime required for the core model.

from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP")

The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.

Variants

Variant	Arch	Format	Size	Best MTP speedup
Qwen3.6-27B-MXFP4-MTP	dense	mxfp4	14.4 GB	1.85× (D2)
Qwen3.6-27B-MXFP8-MTP	dense	mxfp8	27.1 GB	1.83× (D3)
Qwen3.6-35B-A3B-MXFP4-MTP (this)	MoE	mxfp4	21.5 GB	1.56× (D3)
Qwen3.6-35B-A3B-MXFP8-MTP	MoE	mxfp8	35.0 GB	1.71× (D3)

Credits

Quantization toolchain: JANG by Jinho Jang <eric@osaurus.ai>
Base model: Qwen3.6-35B-A3B by Qwen

Downloads last month: 309

Safetensors

Model size

7B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(390)

this model