Instructions to use duvoai/duvo-eye-1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use duvoai/duvo-eye-1.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="duvoai/duvo-eye-1.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("duvoai/duvo-eye-1.5") model = AutoModelForMultimodalLM.from_pretrained("duvoai/duvo-eye-1.5") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use duvoai/duvo-eye-1.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "duvoai/duvo-eye-1.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "duvoai/duvo-eye-1.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/duvoai/duvo-eye-1.5
- SGLang
How to use duvoai/duvo-eye-1.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "duvoai/duvo-eye-1.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "duvoai/duvo-eye-1.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "duvoai/duvo-eye-1.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "duvoai/duvo-eye-1.5", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use duvoai/duvo-eye-1.5 with Docker Model Runner:
docker model run hf.co/duvoai/duvo-eye-1.5
duvo-eye-1.5
duvo-eye-1.5 is a 35B-A3B (≈3B active) Vision-Language Model for single-step GUI element grounding: given a screenshot and a natural-language description of a target, it returns one click position {"x", "y"} in the [0, 1000] normalized range. It is the grounding component of a computer-use stack — a planner decides what to interact with, duvo-eye-1.5 resolves that description to where. It is not an agent: no planning, navigation, multi-step execution, or function calling, and it cannot abstain when the target is absent.
duvo-eye-1.5 is duvo-eye-1 (v1) further trained with GRPO reinforcement learning under a point-in-bbox reward, then merged back to full bf16 weights (v1 + the GRPO LoRA folded in). v1 was itself a LoRA SFT of Hcompany/Holo-3.1-35B-A3B on Duvo's private SynthUI corpus. The RL stage delivers a small, clean, across-the-board gain over v1 — higher on all three public grounding benchmarks we track, with no regressions and 0% parse failures. It refines grounding precision rather than adding capability; it is not a large jump and not state of the art, and we say so plainly below.
| Field | Value |
|---|---|
| Lineage | duvoai/duvo-eye-1 (GRPO RL) ← LoRA SFT of Hcompany/Holo-3.1-35B-A3B |
| Architecture | Qwen3_5MoeForConditionalGeneration (qwen3_5_moe) — 35B-A3B MoE, ~3B active params |
| Task | image-text-to-text (single-step GUI grounding) |
| Output | {"x": int, "y": int} in [0, 1000], scaled to original screenshot pixels |
| Weights | full merged model, bf16, ~66 GB |
| Decoding | greedy, enable_thinking=False, max_new_tokens=64 |
| Language | English (inherited from the base) |
| License | Apache 2.0 |
⚠️ Disable thinking for grounding — this is not a reasoning model
duvo-eye-1.5 is a direct grounder: it is meant to emit the coordinate immediately, with no chain-of-thought. It inherits the Holo-3.1 / Qwen3.5 chat template, whose thinking mode is ON by default, so you must call
apply_chat_template(..., enable_thinking=False). That injects an empty<think></think>block and the model outputs the JSON point directly — exactly how it was trained and evaluated. Left in the default thinking mode it falls back to base-style reasoning and will frequently not produce a parseable coordinate within a short token budget. Thinking does not improve grounding here; it must be turned off. See Usage.
Results
All numbers come from one fixed, validated harness with enable_thinking=False and deterministic (greedy) decoding. The harness is calibrated against v1's published result: it reproduces duvo-eye-1's ScreenSpot-Pro 72.9 exactly (72.99% on the full 1,581 samples), so the v1 → v1.5 deltas below are a trustworthy, apples-to-apples A/B. Figures are as of 2026-06.
| Benchmark | duvo-eye-1 (v1) | duvo-eye-1.5 | Δ |
|---|---|---|---|
| ScreenSpot-Pro (1,581) | 72.99% | 73.31% | +0.32 pp |
| OSWorld-G (510) | 80.2% | 80.78% | +0.58 pp |
| UI-I2E-Bench (1,477) | 84.2% | 84.90% | +0.70 pp |
Parse failures were 0% on every board for both models. The v1 column is v1 re-measured under this same harness (the A/B reference); for OSWorld-G that re-measurement (80.2) runs a couple of points above v1's separately-published 78.0/510 due to harness settings — what is comparable here is the within-harness Δ, not the cross-harness absolute.
Honest framing. This is a real but modest improvement — a clean, monotone uplift from RL grounding refinement on an already-strong model. The headline gains are fractions of a point per board. It is not SOTA and not a step change; if you already run duvo-eye-1, expect a small, safe upgrade, not a new tier of capability.
By target type, the model is much stronger on text targets (82%) than on icon targets (60%). That icon / tiny-target gap is the main remaining weakness, and the RL stage — trained at ~1M px while ScreenSpot-Pro screenshots are 4K — does not close it (see Limitations).
Positioning
The credible pitch is efficiency, reliability, and single-model grounding quality — not "beats frontier." At ~3B active parameters, the v1 lineage sits among the very top single-forward-pass models on ScreenSpot-Pro: v1's 72.9 was verified as #2 of 86 entries on the official leaderboard (2026-06-13), behind only one 8B model and ahead of every larger single model; v1.5's 73.31 nudges that up. The handful of entries scoring higher are multi-step scaffolds and ensembles (iterative zoom, agentic refinement; top ≈80.9) — a separate, more expensive inference class that composes on top of any grounder. duvo-eye-1.5 is a strong, cheap single-model grounder; it is not an agent and does not beat frontier models on agentic benchmarks. For the full landscape (ScreenSpot-v2, UI-I2E, UI-Vision, WebClick, Showdown, and the verified-vs-self-reported breakdown), see the duvo-eye-1 card — those boards were measured on the v1 lineage and were not re-run for v1.5.
Usage
duvo-eye-1.5 is a standard transformers image-text-to-text model. Disable thinking, use the exact grounding prompt below, and parse the JSON point. The returned {x, y} are normalized to [0, 1000]; scale by the original screenshot dimensions, not the downscaled model input.
import json
import re
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "duvoai/duvo-eye-1.5"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
)
image = Image.open("screenshot.png").convert("RGB")
target = "the Save button in the top toolbar"
prompt = (
"Localize an element on the GUI image according to the provided target "
"and output a click position.\n"
' * You must output a valid JSON following the format: {"x": int 0-1000, "y": int 0-1000}\n'
" Your target is:\n" + target
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
enable_thinking=False, # REQUIRED: emit the coordinate directly, not reasoning
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=64, do_sample=False)
output = processor.batch_decode(
generated[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)[0]
# Parse {"x": int 0-1000, "y": int 0-1000}
point = json.loads(re.search(r"\{.*?\}", output, re.DOTALL).group(0))
x_norm, y_norm = point["x"], point["y"]
# Scale to original pixels
w, h = image.size
x_px = round(x_norm / 1000 * w)
y_px = round(y_norm / 1000 * h)
print(x_px, y_px)
Notes:
- Always pass
enable_thinking=False. This injects an empty<think></think>block so the model emits the JSON coordinate directly. Without it the model produces reasoning text and may not reach a parseable point. - Use greedy / deterministic decoding (
do_sample=False) for reproducible grounding. - For high-resolution screenshots (4K, professional software), raise the processor's
max_pixelsso the input is not over-downscaled — resolution is the single biggest lever on dense, small-icon UIs.
Serving
bf16 weights are ~66 GB:
- vLLM TP=1 on one 141 GB H200, or TP=2 on 2×80 GB.
- Tune
max_pixelsfor input resolution; for 4K professional-software screenshots, raise it so fine icons aren't over-downscaled. The[0, 1000]output is resolution-independent — always rescale by the original screenshot size.
Training
duvo-eye-1.5 = duvo-eye-1 + a GRPO reinforcement-learning stage, merged.
Method. GRPO (via TRL) with an attention-only LoRA (rank 16, on q/k/v/o), beta = 0, 200 steps on 4×H100. Rollouts were generated with enable_thinking=False (v1 already grounds the RL data at ~80% greedy, which gives a dense reward signal), then the trained LoRA was folded into v1 to produce the released full model.
Reward. A point-in-bbox reward plus a small format reward:
1.0if the predicted point falls inside the ground-truth bounding box;- otherwise
0.25 · exp(−dist / 150)distance shaping toward the box; - a small format reward for emitting a parseable, in-range
{x, y}.
Data. ~6.5k grounding prompts mined from ServiceNow/GroundCUA (open-source professional-desktop GUIs), trained at ~1M px with bounding boxes normalized to [0, 1000]. The train/eval resolution gap (1M px training vs 4K test screenshots) is why icon precision did not improve much — higher-resolution RL is the next lever.
For the v1 SFT recipe (LoRA on Holo-3.1-35B-A3B over the private duvoai/SynthUI corpus), see the duvo-eye-1 card.
Intended use
- GUI element grounding in a computer-use / desktop-automation pipeline: map a textual target description to a single click point on a screenshot.
- As the grounding stage behind a separate planner (which decides what to click) and, optionally, a verifier / test-time-scaling layer on top.
- Enterprise back-office and professional-software UIs (the v1 lineage was tuned on synthetic back-office UIs; the RL stage used open-source professional-desktop screenshots).
Limitations
- Icons and tiny targets are the weak spot. The model is much stronger on text targets (~82%) than on icon targets (~60%); the RL stage does not close this gap. Dense professional UIs with small icon controls remain the hardest case.
- Disable thinking — it is not a reasoning model. You must call
apply_chat_template(..., enable_thinking=False)for direct grounding. The inherited template defaults to thinking ON; left on, the model emits reasoning instead of a coordinate and often fails to produce a parseable point within a short token budget. Thinking does not improve grounding here. - It always returns a coordinate and cannot abstain. There is no mechanism to say "the target is not present." If the described element is absent, the model still emits a (wrong) click point. Handle absence at the pipeline level.
- Grounder, not agent. No planning, navigation, multi-step execution, or function calling. It resolves one description to one click and nothing more.
- Modest gain over v1. Expect a small, clean uplift (fractions of a point per board), not a step change.
Citation
@misc{duvo2026eye15,
title = {duvo-eye-1.5: a GRPO-refined GUI grounding model},
author = {Duvo AI},
year = {2026},
url = {https://huggingface.co/duvoai/duvo-eye-1.5},
}
Built on Hcompany/Holo-3.1-35B-A3B (H Company) and refined with ServiceNow/GroundCUA. License: Apache 2.0, same as the base.
- Downloads last month
- 38