Instructions to use duvoai/duvo-eye-1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use duvoai/duvo-eye-1.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="duvoai/duvo-eye-1.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("duvoai/duvo-eye-1.5")
model = AutoModelForMultimodalLM.from_pretrained("duvoai/duvo-eye-1.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use duvoai/duvo-eye-1.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "duvoai/duvo-eye-1.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "duvoai/duvo-eye-1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/duvoai/duvo-eye-1.5

SGLang

How to use duvoai/duvo-eye-1.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "duvoai/duvo-eye-1.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "duvoai/duvo-eye-1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "duvoai/duvo-eye-1.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "duvoai/duvo-eye-1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use duvoai/duvo-eye-1.5 with Docker Model Runner:
```
docker model run hf.co/duvoai/duvo-eye-1.5
```

duvo-eye-1.5

duvo-eye-1.5 is a 35B-A3B (≈3B active) Vision-Language Model for single-step GUI element grounding: given a screenshot and a natural-language description of a target, it returns one click position {"x", "y"} in the [0, 1000] normalized range. It is the grounding component of a computer-use stack — a planner decides what to interact with, duvo-eye-1.5 resolves that description to where. It is not an agent: no planning, navigation, multi-step execution, or function calling, and it cannot abstain when the target is absent.

duvo-eye-1.5 is duvo-eye-1 (v1) further trained with GRPO reinforcement learning under a point-in-bbox reward, then merged back to full bf16 weights (v1 + the GRPO LoRA folded in). v1 was itself a LoRA SFT of Hcompany/Holo-3.1-35B-A3B on Duvo's private SynthUI corpus. The RL stage delivers a small, clean, across-the-board gain over v1 — higher on all three public grounding benchmarks we track, with no regressions and 0% parse failures. It refines grounding precision rather than adding capability; it is not a large jump and not state of the art, and we say so plainly below.

Field	Value
Lineage	duvoai/duvo-eye-1 (GRPO RL) ← LoRA SFT of Hcompany/Holo-3.1-35B-A3B
Architecture	`Qwen3_5MoeForConditionalGeneration` (`qwen3_5_moe`) — 35B-A3B MoE, ~3B active params
Task	`image-text-to-text` (single-step GUI grounding)
Output	`{"x": int, "y": int}` in `[0, 1000]`, scaled to original screenshot pixels
Weights	full merged model, bf16, ~66 GB
Decoding	greedy, `enable_thinking=False`, `max_new_tokens=64`
Language	English (inherited from the base)
License	Apache 2.0

⚠️ Disable thinking for grounding — this is not a reasoning model

duvo-eye-1.5 is a direct grounder: it is meant to emit the coordinate immediately, with no chain-of-thought. It inherits the Holo-3.1 / Qwen3.5 chat template, whose thinking mode is ON by default, so you must call apply_chat_template(..., enable_thinking=False). That injects an empty <think></think> block and the model outputs the JSON point directly — exactly how it was trained and evaluated. Left in the default thinking mode it falls back to base-style reasoning and will frequently not produce a parseable coordinate within a short token budget. Thinking does not improve grounding here; it must be turned off. See Usage.

Results

All numbers come from one fixed, validated harness with enable_thinking=False and deterministic (greedy) decoding. The harness is calibrated against v1's published result: it reproduces duvo-eye-1's ScreenSpot-Pro 72.9 exactly (72.99% on the full 1,581 samples), so the v1 → v1.5 deltas below are a trustworthy, apples-to-apples A/B. Figures are as of 2026-06.

Benchmark	duvo-eye-1 (v1)	duvo-eye-1.5	Δ
ScreenSpot-Pro (1,581)	72.99%	73.31%	+0.32 pp
OSWorld-G (510)	80.2%	80.78%	+0.58 pp
UI-I2E-Bench (1,477)	84.2%	84.90%	+0.70 pp

Parse failures were 0% on every board for both models. The v1 column is v1 re-measured under this same harness (the A/B reference); for OSWorld-G that re-measurement (80.2) runs a couple of points above v1's separately-published 78.0/510 due to harness settings — what is comparable here is the within-harness Δ, not the cross-harness absolute.

Honest framing. This is a real but modest improvement — a clean, monotone uplift from RL grounding refinement on an already-strong model. The headline gains are fractions of a point per board. It is not SOTA and not a step change; if you already run duvo-eye-1, expect a small, safe upgrade, not a new tier of capability.

By target type, the model is much stronger on text targets (~~82%) than on icon targets (~~60%). That icon / tiny-target gap is the main remaining weakness, and the RL stage — trained at ~1M px while ScreenSpot-Pro screenshots are 4K — does not close it (see Limitations).

Positioning

The credible pitch is efficiency, reliability, and single-model grounding quality — not "beats frontier." At ~3B active parameters, the v1 lineage sits among the very top single-forward-pass models on ScreenSpot-Pro: v1's 72.9 was verified as #2 of 86 entries on the official leaderboard (2026-06-13), behind only one 8B model and ahead of every larger single model; v1.5's 73.31 nudges that up. The handful of entries scoring higher are multi-step scaffolds and ensembles (iterative zoom, agentic refinement; top ≈80.9) — a separate, more expensive inference class that composes on top of any grounder. duvo-eye-1.5 is a strong, cheap single-model grounder; it is not an agent and does not beat frontier models on agentic benchmarks. For the full landscape (ScreenSpot-v2, UI-I2E, UI-Vision, WebClick, Showdown, and the verified-vs-self-reported breakdown), see the duvo-eye-1 card — those boards were measured on the v1 lineage and were not re-run for v1.5.

Usage

duvo-eye-1.5 is a standard transformers image-text-to-text model. Disable thinking, use the exact grounding prompt below, and parse the JSON point. The returned {x, y} are normalized to [0, 1000]; scale by the original screenshot dimensions, not the downscaled model input.

import json
import re
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "duvoai/duvo-eye-1.5"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
)

image = Image.open("screenshot.png").convert("RGB")
target = "the Save button in the top toolbar"

prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    ' * You must output a valid JSON following the format: {"x": int 0-1000, "y": int 0-1000}\n'
    " Your target is:\n" + target
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    enable_thinking=False,  # REQUIRED: emit the coordinate directly, not reasoning
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=64, do_sample=False)
output = processor.batch_decode(
    generated[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)[0]

# Parse {"x": int 0-1000, "y": int 0-1000}
point = json.loads(re.search(r"\{.*?\}", output, re.DOTALL).group(0))
x_norm, y_norm = point["x"], point["y"]

# Scale to original pixels
w, h = image.size
x_px = round(x_norm / 1000 * w)
y_px = round(y_norm / 1000 * h)
print(x_px, y_px)

Notes:

Always pass enable_thinking=False. This injects an empty <think></think> block so the model emits the JSON coordinate directly. Without it the model produces reasoning text and may not reach a parseable point.
Use greedy / deterministic decoding (do_sample=False) for reproducible grounding.
For high-resolution screenshots (4K, professional software), raise the processor's max_pixels so the input is not over-downscaled — resolution is the single biggest lever on dense, small-icon UIs.

Serving

bf16 weights are ~66 GB:

vLLM TP=1 on one 141 GB H200, or TP=2 on 2×80 GB.
Tune max_pixels for input resolution; for 4K professional-software screenshots, raise it so fine icons aren't over-downscaled. The [0, 1000] output is resolution-independent — always rescale by the original screenshot size.

Training

duvo-eye-1.5 = duvo-eye-1 + a GRPO reinforcement-learning stage, merged.

Method. GRPO (via TRL) with an attention-only LoRA (rank 16, on q/k/v/o), beta = 0, 200 steps on 4×H100. Rollouts were generated with enable_thinking=False (v1 already grounds the RL data at ~80% greedy, which gives a dense reward signal), then the trained LoRA was folded into v1 to produce the released full model.

Reward. A point-in-bbox reward plus a small format reward:

1.0 if the predicted point falls inside the ground-truth bounding box;
otherwise 0.25 · exp(−dist / 150) distance shaping toward the box;
a small format reward for emitting a parseable, in-range {x, y}.

Data. ~6.5k grounding prompts mined from ServiceNow/GroundCUA (open-source professional-desktop GUIs), trained at ~1M px with bounding boxes normalized to [0, 1000]. The train/eval resolution gap (1M px training vs 4K test screenshots) is why icon precision did not improve much — higher-resolution RL is the next lever.

For the v1 SFT recipe (LoRA on Holo-3.1-35B-A3B over the private duvoai/SynthUI corpus), see the duvo-eye-1 card.

Intended use

GUI element grounding in a computer-use / desktop-automation pipeline: map a textual target description to a single click point on a screenshot.
As the grounding stage behind a separate planner (which decides what to click) and, optionally, a verifier / test-time-scaling layer on top.
Enterprise back-office and professional-software UIs (the v1 lineage was tuned on synthetic back-office UIs; the RL stage used open-source professional-desktop screenshots).

Limitations

Icons and tiny targets are the weak spot. The model is much stronger on text targets (~82%) than on icon targets (~60%); the RL stage does not close this gap. Dense professional UIs with small icon controls remain the hardest case.
Disable thinking — it is not a reasoning model. You must call apply_chat_template(..., enable_thinking=False) for direct grounding. The inherited template defaults to thinking ON; left on, the model emits reasoning instead of a coordinate and often fails to produce a parseable point within a short token budget. Thinking does not improve grounding here.
It always returns a coordinate and cannot abstain. There is no mechanism to say "the target is not present." If the described element is absent, the model still emits a (wrong) click point. Handle absence at the pipeline level.
Grounder, not agent. No planning, navigation, multi-step execution, or function calling. It resolves one description to one click and nothing more.
Modest gain over v1. Expect a small, clean uplift (fractions of a point per board), not a step change.

Citation

@misc{duvo2026eye15,
  title  = {duvo-eye-1.5: a GRPO-refined GUI grounding model},
  author = {Duvo AI},
  year   = {2026},
  url    = {https://huggingface.co/duvoai/duvo-eye-1.5},
}

Built on Hcompany/Holo-3.1-35B-A3B (H Company) and refined with ServiceNow/GroundCUA. License: Apache 2.0, same as the base.

Downloads last month: 38

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for duvoai/duvo-eye-1.5

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

Hcompany/Holo-3.1-35B-A3B

Finetuned

(1)

this model

duvoai
/

duvo-eye-1.5

duvo-eye-1.5

⚠️ Disable thinking for grounding — this is not a reasoning model

Results

Positioning

Usage

Serving

Training

Intended use

Limitations

Citation

Model tree for duvoai/duvo-eye-1.5