Qwen3.6-35B-A3B β€” Core AI (gather_qmm kernel, 2.1Γ— faster)

Apple Core AI (.aimodel) conversion of Qwen/Qwen3.6-35B-A3B (text decoder): Qwen3.5's hybrid GatedDeltaNet + gated-attention body with a 256-expert top-8 sparse MoE (+ shared expert). 35B total / ~3B active per token.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/qwen3.6.md).

The gather_qmm kernel β€” 30.9 β†’ 64.9 tok/s (2.1Γ—)

Apple's GatherMM composite gathers the routed experts then runs a dense matmul that reads all 256 experts' weights every token β€” over-read-bound at 30.9 tok/s. This bundle uses a custom coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the 8 routed experts' weight slabs (8/256), so decode runs at active-param (~3B) bandwidth: 64.9 tok/s, 2.1Γ—.

Quality is clean and unchanged. The kernel reads the sym8 scheme = the same symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact gather: 0 introduced flips / 18 vs fp16 (the shipped GatherMM int8 was 14/16 vs the bf16 oracle; this matches it). So this is a pure speed win at the same quality.

bundle size decode tok/s quality
gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather/ 35 GB 64.9 clean (0 flips/18 vs fp16) βœ…

Mac-only (35 GB int8 is far past the iPhone limit; this is the 64/128 GB-Mac flagship).

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather -p 128 -g 256 -n 3

The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your own with conversion/export_qwen3_6_moe_metal_decode_pipelined.py.

License

Apache-2.0 (upstream Qwen license). Conversion + gather_qmm kernel: community.

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Qwen3.6-35B-A3B-CoreAI

Finetuned
(132)
this model