Qwen3.6-35B-A3B — Core AI (`gather_qmm` kernel, 2.1× faster)

Apple Core AI (.aimodel) conversion of Qwen/Qwen3.6-35B-A3B (text decoder): Qwen3.5's hybrid GatedDeltaNet + gated-attention body with a 256-expert top-8 sparse MoE (+ shared expert). 35B total / ~3B active per token.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/qwen3.6.md).

The `gather_qmm` kernel — 30.9 → 64.9 tok/s (2.1×)

Apple's GatherMM composite gathers the routed experts then runs a dense matmul that reads all 256 experts' weights every token — over-read-bound at 30.9 tok/s. This bundle uses a custom coreai_torch.TorchMetalKernel that takes the routed indices as a kernel input and reads only the 8 routed experts' weight slabs (8/256), so decode runs at active-param (~3B) bandwidth: 64.9 tok/s, 2.1×.

Quality is clean and unchanged. The kernel reads the sym8 scheme = the same symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact gather: 0 introduced flips / 18 vs fp16 (the shipped GatherMM int8 was 14/16 vs the bf16 oracle; this matches it). So this is a pure speed win at the same quality.

bundle	size	decode tok/s	quality
`gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather/`	35 GB	64.9	clean (0 flips/18 vs fp16) ✅

Mac-only (35 GB int8 is far past the iPhone limit; this is the 64/128 GB-Mac flagship).

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/qwen3_6_35b_a3b_decode_sym8_gather -p 128 -g 256 -n 3

The decode graph's input_ids is static [1,1]; prefill runs as S=1 pipelined steps. Convert your own with conversion/export_qwen3_6_moe_metal_decode_pipelined.py.

License

Apache-2.0 (upstream Qwen license). Conversion + gather_qmm kernel: community.

Downloads last month: 37

Model tree for mlboydaisuke/Qwen3.6-35B-A3B-CoreAI

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(132)

this model

Qwen3.6-35B-A3B — Core AI (gather_qmm kernel, 2.1× faster)

The gather_qmm kernel — 30.9 → 64.9 tok/s (2.1×)

Run

License

Model tree for mlboydaisuke/Qwen3.6-35B-A3B-CoreAI

Qwen3.6-35B-A3B — Core AI (`gather_qmm` kernel, 2.1× faster)

The `gather_qmm` kernel — 30.9 → 64.9 tok/s (2.1×)