Qwen3.5-2B β€” Apple Core AI (.aimodel)

Qwen3.5-2B (GDN hybrid: 18 linear-attention + 6 full-attention layers) converted to Apple Core AI for iOS 27 / macOS 27 (beta), riding Apple's coreai-pipelined GPU engine via the decode-only loop-free export β€” async encode, on-GPU argmax sampling, on-device KV growth, zero custom kernels.

surface (ship bundle) prefill (S=1) decode
M4 Max (release llm-benchmark, p=128 g=256) 161.2 160.8 tok/s
iPhone 17 Pro (one-shot runner, 2 runs Γ— 2 trials) 29.7–30.3 28–30 tok/s β€” β‰₯ the CoreML qwen3.5-2B port (~27)

Numerics: 16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded decode step (the zoo ship gate), greedy rollouts token-identical to the fp16-head bundle, and the iPhone sequences are 24/24 token-identical to the Mac GPU on both fixed prompts.

Bundles

  • gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/ β€” the ship config (2.9 GB): transformer int8 linear per-block-32 + untied lm_head in per-block-32 absmax int8 (int8hu --head-sym). The head trick is what unlocks the speed: the 248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head must be quantized with plain absmax symmetric β€” the default symmetric_with_clipping clips outlier head rows and flips oracle top-1s (full story in the zoo's pipelined-engine notes). Naming note (2026-06-11): the directory says _perchan_sym, but its head is per-block-32 β€” the export script of the day parsed the granularity flag without applying it (since fixed); byte-identical bundle sizes confirmed it. All numbers were measured on exactly these bytes and stand. Genuinely per-channel (axis-0) int8 weights are broken on the current beta GPU delegate (garbage logits), so per-block-32 + symmetric IS the correct ship shape. The dir name is kept to avoid breaking download paths.
  • gpu-pipelined/qwen3_5_2b_decode_int8lin/ β€” fp16-head variant (2.4 GB): 127 tok/s Mac / 19–21 iPhone. Smaller; keep if you want the head at full precision.

Both are full LanguageBundles (metadata.json + tokenizer/ + .aimodel), input_ids STATIC [1,1] (loop-free single-step GDN), position_ids + KV seq dynamic β†’ EngineFactory classifies them dynamic β†’ pipelined engine.

Run (macOS)

Needs the engine patch stack from the zoo (apps/coreai-shared-product.patch β†’ apps/coreai-pipelined-extra-states.patch; Apple's repo is issues-only, so capabilities ship as patches), then:

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3
  • COREAI_CHUNK_THRESHOLD=1 before engine creation β€” prefill runs as pipelined S=1 steps (prompt tok/s β‰ˆ decode tok/s).
  • Never call engine.warmup() β€” it warms query length 256 and the static [1,1] graph rejects it. A 1-token generate after load is the warmup (llm-runner needs --warmup exact --warmup-length 1).
  • Benchmark Release builds only (a Debug engine measures ~3Γ— slow).

iPhone

The ship bundle decodes 28–30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship:

  • Requires the com.apple.developer.kernel.increased-memory-limit entitlement β€” cold GPU specialization dies with std::bad_alloc at the default jetsam limit without it.
  • Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep β‰₯4 GB free disk: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make later attempts fail with NSPOSIXErrorDomain code=2 at engine create β€” uninstall the app to reclaim.
  • For smaller phones / tighter RAM, the 0.8B pipelined bundle does 50+ tok/s in 1 GB.

Reproduce

Conversion script (self-contained) + method page in the zoo: conversion/export_qwen3_5_decode_pipelined.py (int8hu --head-sym --hf-id Qwen/Qwen3.5-2B) Β· knowledge/pipelined-engine.md

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/qwen3.5-2B-CoreAI

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(199)
this model