Qwen3-VL 8B — Core AI (`.aimodel`)

Qwen/Qwen3-VL-8B-Instruct converted to Apple Core AI (.aimodel, macOS 27): image+text → text fully on the GPU via Apple's coreai-pipelined engine, zero custom kernels. The 8B sibling of the Qwen3-VL 2B port — same recipe, with one one-line loader change for its untied LM head.

Mac-only. The 8.7 GB int8hu decoder exceeds the iPhone increased-memory jetsam ceiling (~6.4 GB class). For on-device iPhone use, see the 4B or 2B ports.

Part of the CoreAI-Model-Zoo; full card with the conversion design: zoo/qwen3-vl.md.

Measured

platform	prefill tok/s	decode tok/s	numerics
M4 Max (macOS 27 beta)	54.4	54.3	torch ladder vs fp32-HF incl. untied head + depth-27 ViT (vision cos 1.0001, 36/36 layers cos 1.000, decode 16/16) + engine ≡ python 24/24 on the 211-tok multimodal prompt

Decode is bandwidth-bound: the 8.7 GB int8hu decoder reads ~8.7 GB/token. Vision encode runs once per image. Cold GPU specialization ~16.5 s, warm load a few seconds.

Files

path	what	size
`gpu-pipelined/qwen3_vl_8b_instruct_decode_int8hu_s1/`	text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included)	8.7 GB
`gpu-pipelined/qwen3_vl_8b_instruct_vision/`	fixed-grid vision encoder (448×448 → 196 tokens + DeepStack), fp16	1.1 GB

How it works (short version)

The text-only pipelined engine carries the VLM through an id-space trick — no engine code changes beyond the published static-inputs patch:

the vision encoder runs once per image; its embeddings ride 4 static graph inputs (rewritable owned MTLBuffers),
the prompt's <|image_pad|> ids become extension ids vocab + slot; the graph selects text-table vs image-embed rows per token and applies the three DeepStack adds the same way,
interleaved M-RoPE is derived in-graph from (ids, position) alone — image tokens self-locate, text tokens use a host-set shift; with zero embeds the same bundle is a plain Qwen3 text LLM.

The 8B differs from 2B/4B only in configuration: its LM head is untied (a separate lm_head.weight, quantized int8 absmax like the body) and its ViT is larger (depth 27, hidden 1152) — both absorbed by the config-driven overlay. Numerics are gated the zoo way: fp32-HF oracle → torch ladder (36/36 layers) → engine ≡ python 24/24.

Run it

Conversion is reproducible from the zoo: conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-8B-Instruct. For the run contract (S=1 prefill, COREAI_CHUNK_THRESHOLD=1), see knowledge/pipelined-engine.md.

License

Apache-2.0 (inherited from Qwen3-VL-8B-Instruct). Conversion code BSD-3-Clause (zoo repo).

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Qwen3-VL-8B-CoreAI

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(300)

this model