Qwen3.6-27B — Core AI (Apple) bundle

The dense Mac-class companion to Qwen3.6-35B-A3B-CoreAI, converted for Apple's Core AI runtime (iOS/macOS 26+ successor to Core ML). Source: Qwen/Qwen3.6-27B (text decoder).

Where the 35B-A3B is a sparse MoE, the 27B is the same Qwen3.5 hybrid decoder run dense — no experts, no router, just the proven token mixers at scale. 64 layers on a 3:1 interleave of GatedDeltaNet linear-attention mixers and gated full attention:

  • full attention: head_dim 256, GQA 24 query / 4 KV heads, partial mRoPE θ=1e7, swish output gate;
  • linear (GatedDeltaNet): 48 value heads over 16 key heads (GVA — each k/q head shared across three value heads, vs the 35B's two);
  • every FFN a dense MLP(17408) (no MoE); untied 248320-vocab lm_head.

27B parameters, all dense → the entire model is read per token. Unlike the 35B-A3B (≈3B active), there is no sparsity to hide behind: this is a true 27B-class decode — the quality of a large dense model at the memory-bandwidth speed that implies on a Mac.

Bundle

gpu-pipelined/qwen3_6_27b_decode_int8hu_block32_sym/ — a ready-to-run Core AI LanguageBundle (.aimodel + metadata.json + tokenizer), 28 GB, decode-only loop-free for Apple's pipelined GPU engine. int8 linear per-block-32 weights + an absmax int8 untied head (int8hu --head-sym).

Measured (macOS 27 beta, M4 Max 128 GB, llm-benchmark, COREAI_CHUNK_THRESHOLD=1)

metric value
decode 15.9 tok/s
prefill 15.8 tok/s (pipelined S=1)
bundle 28 GB
numerics int8 == full precision at every confident position (teacher-forced vs bf16 HF oracle)

Numerics in full — 27B fp32 would need ~111 GB RAM, so the oracle is the checkpoint's native bf16; the gate is teacher-forced single-step argmax under an oracle-margin≥0.1 rule. The result is cleaner than the 35B-A3B's: int8 adds zero confident-margin flips over full precision. Both int8hu and an fp16 control score 15/16 vs the bf16 oracle and fail the same position (margin 0.50), where fp16 flips byte-identically to int8 — a bf16-oracle-resolution artifact, not an int8 defect. The only int8-vs-fp16 difference anywhere is one sub-0.1-margin tie.

Speed is bandwidth-bound, as a dense 27B at int8 must be: ~28 GB/token → 15.9 tok/s is ~87 % of the M4 Max memory-bandwidth ceiling. (The 35B-A3B decodes faster than this despite more parameters, because only ~3B are active per token — that is the MoE's whole point.)

int4 is a size/speed option, not the quality ship. A linear int4 bundle (int4lin, ~14 GB, ~2× decode) gates 15/16 too but pays a real cost: it flips a high-confidence position that fp16 and int8 both get right, and its per-position cosine is systematically lower. A mixed-precision middle ground (MLP int4 / attention·GatedDeltaNet·head int8) was tested and rejected — keeping the mixers at int8 repairs int4's flip (confirming the attention/GDN path, not the FFN, is the 4-bit-sensitive part), but the int4 MLP then introduces its own confident flip that edge-layer int8 cannot fix. So there is no quality-preserving speedup between int8 (clean, 15.9 tok/s) and int4 (borderline, ~30): int8 is the quality ship, int4 the size/speed option, nothing useful between.

Mac-only: at 28 GB this is a 64/128 GB-Mac model (far past the iPhone memory limit).

How to run

This is a Core AI bundle for Apple's pipelined LLM engine (llm-benchmark / llm-runner from apple/coreai-models, plus the community pipelined extra-states patch). The conversion recipe and the full write-up live in the community zoo: github.com/john-rocky/coreai-model-zoo (zoo/qwen3.6-27b.md). The decoder reuses the shared qwen3_5.py overlay directly — no MoE files.

COREAI_CHUNK_THRESHOLD=1 llm-benchmark \
    --model gpu-pipelined/qwen3_6_27b_decode_int8hu_block32_sym -p 64 -g 128 -n 3
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Qwen3.6-27B-CoreAI

Base model

Qwen/Qwen3.6-27B
Finetuned
(218)
this model