Qwen3-VL 2B β€” Core AI (.aimodel)

The first vision-language model on Apple's Core AI framework (iOS 27 / macOS 27): Qwen/Qwen3-VL-2B-Instruct converted to .aimodel, running image+text β†’ text fully on the GPU via Apple's coreai-pipelined engine β€” zero custom kernels.

Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.

CoreAIChat Qwen3-VL demo

Measured

platform prefill tok/s decode tok/s numerics
M4 Max (macOS 27 beta) 191.0 187.6 full multimodal oracle gates vs fp32-HF PASS
iPhone 17 Pro (iOS 27 beta, settled) 33.9 33.3 text + image prompts 24/24 Γ— 8 runs, token-identical to Mac (~92% of the naive BW ceiling)

Vision encode: ~60-80 ms/image (Mac GPU). Device cold load 12.3 s (on-device GPU specialization, no AOT), warm 0.6–5 s. The 2.3 GB decoder wants the increased-memory entitlement on iPhone.

Files

path what size
gpu-pipelined/qwen3_vl_2b_instruct_decode_int8hu_s1/ text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) 2.3 GB
gpu-pipelined/qwen3_vl_2b_instruct_vision/ fixed-grid vision encoder (448Γ—448 β†’ 196 tokens + DeepStack), fp16 0.77 GB
gpu-pipelined/qwen3_vl_2b_instruct_decode_int8lin_s1/ decoder alt: tied fp16 head (slower, smaller-RAM-spike option) 2.0 GB

How it works (short version)

The text-only pipelined engine carries the VLM through an id-space trick β€” no engine code changes beyond the published static-inputs patch:

  • the vision encoder runs once per image; its embeddings ride 4 static graph inputs (rewritable owned MTLBuffers, ~3 MB),
  • the prompt's <|image_pad|> ids become extension ids vocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way,
  • interleaved M-RoPE is derived in-graph from (ids, position) alone β€” image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.

Numerics are gated the zoo way: fp32-HF oracle β†’ torch ladder (position formula exact vs get_rope_index, 28/28 layers) β†’ .aimodel GPU gates β†’ engine ≑ python 24/24 β†’ device 24/24.

Run it

The zoo's apps/CoreAIChat (iOS) has a Qwen3-VL mode with a photo picker and downloads this repo in-app. For the run contract (S=1 prefill, COREAI_CHUNK_THRESHOLD=1, never engine.warmup()), see knowledge/pipelined-engine.md.

Conversion is reproducible from the zoo: conversion/export_qwen3_vl_pipelined.py int8hu.

License

Apache-2.0 (inherited from Qwen3-VL-2B-Instruct). Conversion code BSD-3-Clause (zoo repo).

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/Qwen3-VL-2B-CoreAI

Finetuned
(222)
this model