Qwen3.6-35B-A3B β Core AI (gather_qmm kernel, 2.1Γ faster)
Apple Core AI (.aimodel) conversion of Qwen/Qwen3.6-35B-A3B
(text decoder): Qwen3.5's hybrid GatedDeltaNet + gated-attention body with a 256-expert top-8
sparse MoE (+ shared expert). 35B total / ~3B active per token.
Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo
(full card: zoo/qwen3.6.md).
The gather_qmm kernel β 30.9 β 64.9 tok/s (2.1Γ)
Apple's GatherMM composite gathers the routed experts then runs a dense matmul that reads all
256 experts' weights every token β over-read-bound at 30.9 tok/s. This bundle uses a custom
coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the
8 routed experts' weight slabs (8/256), so decode runs at active-param (~3B) bandwidth: 64.9
tok/s, 2.1Γ.
Quality is clean and unchanged. The kernel reads the sym8 scheme = the same
symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact
gather: 0 introduced flips / 18 vs fp16 (the shipped GatherMM int8 was 14/16 vs the bf16 oracle;
this matches it). So this is a pure speed win at the same quality.
| bundle | size | decode tok/s | quality |
|---|---|---|---|
gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather/ |
35 GB | 64.9 | clean (0 flips/18 vs fp16) β |
Mac-only (35 GB int8 is far past the iPhone limit; this is the 64/128 GB-Mac flagship).
Run
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather -p 128 -g 256 -n 3
The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your
own with conversion/export_qwen3_6_moe_metal_decode_pipelined.py.
License
Apache-2.0 (upstream Qwen license). Conversion + gather_qmm kernel: community.
- Downloads last month
- 37
Model tree for mlboydaisuke/Qwen3.6-35B-A3B-CoreAI
Base model
Qwen/Qwen3.6-35B-A3B