Holo2-4B — Core AI (on-device, iPhone) · GUI-grounding VLM

Hcompany/Holo2-4B converted to Apple Core AI for on-device inference, served by the CoreAIChat app.

Holo2 is H Company's computer-use / GUI-grounding vision-language model: given a screenshot and an instruction ("click the submit button"), it predicts the click coordinates / locates the UI element. Built on the Qwen3-VL-4B backbone, so it rides the Core AI zoo's existing Qwen3-VL pipeline. The zoo's first GUI-grounding / computer-use model.

Contents (`gpu-pipelined/`)

holo2_4b_decode_int8lin_s1/ — the decode bundle (static query=1, per-block-32 int8 linear body; rides Apple's coreai-pipelined GPU engine, specializes on-device — no AOT needed). ~4.4 GB.
holo2_4b_vision/ — the fixed-grid vision encoder .aimodel (fp16): patches [784,1536] -> (image_embeds [196,2560], deepstack [3,196,2560]). Run once per image. ~0.8 GB.

Parity (vs fp32 HF oracle, Core AI GPU engine)

Vision: image-embeds cos 0.999983, deepstack cos 0.999989.
Decoder (int8lin): teacher-forced S=1 sweep 4/4, 16/16 decode steps token-exact, HF-seeded decode match. All PASS.

Use

Install CoreAIChat, pick Holo2 4B, attach a screenshot, and ask where an element is / what to click — it grounds the instruction to the image.

License

Apache-2.0, inherited from the base model Hcompany/Holo2-4B. See LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Holo2-4B-CoreAI

Base model

Qwen/Qwen3-VL-4B-Thinking

Finetuned

Hcompany/Holo2-4B