Qwen3-VL 4B β Core AI (.aimodel)
Qwen/Qwen3-VL-4B-Instruct converted to Apple Core AI (.aimodel, iOS 27 /
macOS 27): image+text β text fully on the GPU via Apple's coreai-pipelined
engine, zero custom kernels. The 4B sibling of the
Qwen3-VL 2B port β it
drops onto the same recipe with zero code changes (the model overlay and
exporter are fully config-driven).
Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.
Measured
| platform | prefill tok/s | decode tok/s | numerics |
|---|---|---|---|
| M4 Max (macOS 27 beta) | 93.3 | 92.2 | torch ladder vs fp32-HF (positions exact, vision cos 1.000, 36/36 layers cos 1.000, decode 16/16) + engine β‘ python 24/24 on the 211-tok multimodal prompt |
| iPhone 17 Pro (iOS 27 beta) | 10β15 | 14.0 cool β ~8.5 sustained | nat 24/24 + multimodal oracle 24/24 Γ 3 runs, token-identical to Mac |
Decode is bandwidth-bound: the 4.7 GB int8hu decoder reads ~4.7 GB/token, so it runs at roughly half the 2B's rate. On iPhone the read is heavy enough to thermally throttle β ~14 tok/s from a cool start, settling to ~8.5 under sustained decode. Device cold load 52.7 s (on-device GPU specialization, no AOT), warm 8β9 s; needs the increased-memory entitlement (4.7 GB class).
Files
| path | what | size |
|---|---|---|
gpu-pipelined/qwen3_vl_4b_instruct_decode_int8hu_s1/ |
text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 4.7 GB |
gpu-pipelined/qwen3_vl_4b_instruct_vision/ |
fixed-grid vision encoder (448Γ448 β 196 tokens + DeepStack), fp16 | 0.79 GB |
How it works (short version)
The text-only pipelined engine carries the VLM through an id-space trick β no engine code changes beyond the published static-inputs patch:
- the vision encoder runs once per image; its embeddings ride 4 static
graph inputs (rewritable owned
MTLBuffers), - the prompt's
<|image_pad|>ids become extension idsvocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way, - interleaved M-RoPE is derived in-graph from (ids, position) alone β image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.
Numerics are gated the zoo way: fp32-HF oracle β torch ladder (position
formula exact vs get_rope_index, 36/36 layers) β .aimodel GPU β engine β‘
python 24/24 β device 24/24.
Run it
See the zoo's apps/CoreAIChat (iOS) Qwen3-VL mode and the run contract
(S=1 prefill, COREAI_CHUNK_THRESHOLD=1, never engine.warmup()) in
knowledge/pipelined-engine.md.
Conversion is reproducible from the zoo:
conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-4B-Instruct.
License
Apache-2.0 (inherited from Qwen3-VL-4B-Instruct). Conversion code BSD-3-Clause (zoo repo).
- Downloads last month
- 3
Model tree for mlboydaisuke/Qwen3-VL-4B-CoreAI
Base model
Qwen/Qwen3-VL-4B-Instruct