Qwen3.6-35B-A3B — combined trunk + MTP head — MLX 8-bit

A single-file MLX conversion of Qwen3.6-35B-A3B (a Qwen3-Next / MoE model, 256 experts, top-8, ~3B active params) that keeps the model's Multi-Token Prediction (MTP) head inline and fully intact, converted directly from the official Qwen bf16 weights. Built for lemon-mlx-engine on AMD ROCm.

Why this conversion is different

  • The MTP head is complete. Other MLX MTP conversions drop the head's 256 routed experts (keeping only the router + shared expert), which cripples draft quality. This conversion preserves all 256 head experts, so the speculative head actually drafts well — ~77% draft acceptance at n_draft=2 (greedy), vs near-useless drafts when the experts are missing.
  • Direct-from-Qwen, correct Qwen3-Next handling. The zero-centered RMSNorm (effective weight = stored + 1.0, applied to every norm incl. the three MTP head norms) and the conv1d weight layout are applied at convert time. Skipping these produces incoherent output.
  • One file: trunk + MTP together. Loadable by lemon-mlx-engine's one-file path — no separate draft model to manage.
  • Draft-fidelity precision. The tiny (~0.5 GB) MTP head is kept in bf16 regardless of trunk precision, since it is quant-sensitive and over-quantizing it lowers acceptance.

Variants (this org)

precision size repo
4-bit ~20 GB LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-4bit
6-bit ~28 GB LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-6bit
8-bit ~36 GB LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit

Trunk weights are quantized to 8-bit; the MTP head stays bf16.

Performance (Radeon 8060S / gfx1151 APU, lemon-mlx-engine)

  • 4-bit decode ≈ 40 tok/s, 8-bit ≈ 30 tok/s (no-MTP, greedy, ~1k ctx).
  • MTP draft acceptance is high at low draft counts (~77% @ n_draft=2, 4-bit; ~70% @ n_draft=2, 8-bit) and falls as draft length grows.
  • Note: on this MoE-A3B, MTP is roughly throughput-neutral — each draft token activates its own top-8 experts so the verification pass doesn't amortize the way it does for dense models (the same ceiling llama.cpp reports, ~1.2× on this class of APU). The value here is a correct, complete MTP head for speculative decoding and research, not a large speedup on this hardware.

Usage (lemon-mlx-engine)

# plain decode
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp=false
# speculative decode with the inline MTP head (n_draft=2 is the sweet spot here)
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp --n-draft 2

Requirements

These are lemon-mlx-engine models (combined trunk+MTP, Qwen3-Next handling baked in) and target AMD ROCm. The 6-bit variant in particular requires lemon-mlx-engine's fixed ROCm quantized-matmul kernel (stock builds without that fix mis-handle 6-bit packing and produce garbage; 4-bit and 8-bit are unaffected).

Provenance

Converted with lemon-mlx-engine's convert tool directly from the official Qwen/Qwen3.6-35B-A3B bf16 checkpoint. License inherited from the base model.

Downloads last month
496
Safetensors
Model size
11B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit

Finetuned
(143)
this model