Qwen3-VL 8B β Core AI (.aimodel)
Qwen/Qwen3-VL-8B-Instruct converted to Apple Core AI (.aimodel,
macOS 27): image+text β text fully on the GPU via Apple's coreai-pipelined
engine, zero custom kernels. The 8B sibling of the
Qwen3-VL 2B port β
same recipe, with one one-line loader change for its untied LM head.
Mac-only. The 8.7 GB int8hu decoder exceeds the iPhone increased-memory jetsam ceiling (~6.4 GB class). For on-device iPhone use, see the 4B or 2B ports.
Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.
Measured
| platform | prefill tok/s | decode tok/s | numerics |
|---|---|---|---|
| M4 Max (macOS 27 beta) | 54.4 | 54.3 | torch ladder vs fp32-HF incl. untied head + depth-27 ViT (vision cos 1.0001, 36/36 layers cos 1.000, decode 16/16) + engine β‘ python 24/24 on the 211-tok multimodal prompt |
Decode is bandwidth-bound: the 8.7 GB int8hu decoder reads ~8.7 GB/token. Vision encode runs once per image. Cold GPU specialization ~16.5 s, warm load a few seconds.
Files
| path | what | size |
|---|---|---|
gpu-pipelined/qwen3_vl_8b_instruct_decode_int8hu_s1/ |
text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 8.7 GB |
gpu-pipelined/qwen3_vl_8b_instruct_vision/ |
fixed-grid vision encoder (448Γ448 β 196 tokens + DeepStack), fp16 | 1.1 GB |
How it works (short version)
The text-only pipelined engine carries the VLM through an id-space trick β no engine code changes beyond the published static-inputs patch:
- the vision encoder runs once per image; its embeddings ride 4 static
graph inputs (rewritable owned
MTLBuffers), - the prompt's
<|image_pad|>ids become extension idsvocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way, - interleaved M-RoPE is derived in-graph from (ids, position) alone β image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.
The 8B differs from 2B/4B only in configuration: its LM head is untied
(a separate lm_head.weight, quantized int8 absmax like the body) and its
ViT is larger (depth 27, hidden 1152) β both absorbed by the config-driven
overlay. Numerics are gated the zoo way: fp32-HF oracle β torch ladder
(36/36 layers) β engine β‘ python 24/24.
Run it
Conversion is reproducible from the zoo:
conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-8B-Instruct.
For the run contract (S=1 prefill, COREAI_CHUNK_THRESHOLD=1), see
knowledge/pipelined-engine.md.
License
Apache-2.0 (inherited from Qwen3-VL-8B-Instruct). Conversion code BSD-3-Clause (zoo repo).
- Downloads last month
- 3
Model tree for mlboydaisuke/Qwen3-VL-8B-CoreAI
Base model
Qwen/Qwen3-VL-8B-Instruct