Qwen3.6-35B-A3B MTPLX Optimized Speed

Fast local 35B-A3B inference for Apple Silicon, packaged for MTPLX native Multi-Token-Prediction speculative decoding.

This is the speed-focused 35B checkpoint: a compact 4-bit MLX body with calibrated INT4 MTP heads, tuned so MTPLX can draft more confidently and verify less work per generated token.

Run It

brew install youssofal/mtplx/mtplx
mtplx start
mtplx run "hello" --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Speed

For an OpenAI-compatible local server:

mtplx serve --model Youssofal/Qwen3.6-35B-A3B-MTPLX-Optimized-Speed --profile sustained --max --port 8000 --no-stats-footer

Why This Exists

MTPLX uses the model's own MTP heads to generate draft tokens, then verifies them with the main model. When the draft heads are well-matched, you get higher throughput without using a separate drafter model.

This checkpoint is optimized for that path. MTPLX reads mtplx_runtime.json and selects the measured speed defaults automatically.

Recommended Runtime Defaults

Setting Value
Backend qwen3-next-mtp
Default depth D1
Verifier strategy target_prefix
Target sampler temp=0.60, top_p=0.95, top_k=20
Draft sampler temp=0.60, top_p=0.95, top_k=20
Profile sustained
Benchmark fan mode max

Performance

Measured in MTPLX Sustained Max on Apple Silicon with reasoning enabled and a 32k token response allowance. All recorded runs finished naturally without a length stop.

Mode TPS Verify time Acceptance
AR baseline 94.46 - -
D1 promoted default 138.39 69.30s 0.8858
D2 comparison 135.66 49.23s 0.8701, 0.6409
D3 comparison 107.67 46.45s 0.8291, 0.5414, 0.2783

D1 is the promoted default because it gives the best real-use balance: strong TPS, high acceptance, and lower verify cost than the earlier 35B speed baseline.

Model Build

Component Format
Main body 4-bit MLX affine, group size 64
Router and gate tensors 8-bit where recorded by config
MTP numbered-expert weights calibrated INT4, group size 32
Norms, scales, biases, plain tensors BF16

This is not a full-precision checkpoint. It is built for fast local use on Apple Silicon through MTPLX.

Files

  • model-*.safetensors: MLX 4-bit body shards
  • mtp.safetensors: calibrated INT4 MTP sidecar
  • mtplx_runtime.json: MTPLX runtime contract and measured defaults
  • MTPLX_PUBLISH_MANIFEST.json: file sizes and benchmark summary
  • tokenizer and config files for local loading
Downloads last month
899
Safetensors
Model size
6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support