Ornith-1.0-35B-MTPLX

A multi-token-prediction (MTP) graft onto deepreinforce-ai/Ornith-1.0-35B, packaged for MTPLX native speculative decoding on Apple Silicon.

Ornith-1.0-35B (qwen35moe, 35B-A3B, Qwen3.5 base) is a strong agentic-coding MoE but ships without MTP heads. This build grafts the official 1-layer MoE-MTP from Qwen/Qwen3.5-35B-A3B (the upstream base of Ornith — dimensions match exactly) and quantizes the MTP to 4-bit, which is what makes the speedup practical (see below).

Performance (M3 Max, measured)

Mode tok/s Speedup MTP acceptance
AR (no MTP) 76.3 1.00×
MTP depth 1 103.6 1.36× 89.6%
MTP depth 2 114.9 1.50× 93.1% / 78.2%
MTP depth 3 116.4 1.53× 91.5% / 80.3% / 65.6%

verdict: mtp_depth_wins · MTPLX inspect tier: verified

Why 4-bit MTP (not bf16)

A bf16 graft of this same MTP layer was a net ~20× slowdown despite 92% acceptance — the MTP is a full 256-expert MoE layer, and at bf16 its draft forward costs 67–143 ms/token (it doesn't hit the fast MoE kernel). Quantizing the MTP experts to 4-bit affine drops the draft cost to ~2.5 ms/token (27–51×) with negligible acceptance loss, flipping the result to a real 1.53× speedup. The bottleneck was draft cost, not acceptance.

Quantization

  • Body: 4-bit affine, group size 64
  • MTP sidecar: 4-bit affine, group size 64 (applied at load via mtplx_mtp_quantization in config.json)
  • Architecture: Qwen3_5MoeForConditionalGeneration / MTPLX arch_id qwen3-next-mtp

Usage (MTPLX)

mtplx tune  --model <path-to-this-model>      # confirms best MTP depth (D3, ~1.53×)
mtplx start --model <path-to-this-model>      # serve with MTP speculative decoding

Provenance & licensing

This derivative is released under MIT, preserving the base model's license. The grafted MTP tensors originate from the Apache-2.0 licensed Qwen3.5-35B-A3B; that license and its NOTICE apply to those tensors. No weights were retrained — this is a graft + quantization repackaging.

Graft notes (reproducibility)

  • MTP mtp.* tensors lifted from Qwen3.5-35B-A3B (785 tensors: mtp.fc + mtp.layers.0.* with 256 experts).
  • Qwen3.5 RMSNorm uses delta encoding → MTP sidecar carries mtplx_mtp_norm_encoding="delta".
  • Forged with MTPLX forge (mtp_policy=requantize, --allow-degraded-mtp); contract calibrated exact_agreement.
Downloads last month
199
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wang-yang/Ornith-1.0-35B-MTPLX

Quantized
(270)
this model