duvo-eye-1.5

duvo-eye-1.5 is a 35B-A3B (≈3B active) Vision-Language Model for single-step GUI element grounding: given a screenshot and a natural-language description of a target, it returns one click position {"x", "y"} in the [0, 1000] normalized range. It is the grounding component of a computer-use stack — a planner decides what to interact with, duvo-eye-1.5 resolves that description to where. It is not an agent: no planning, navigation, multi-step execution, or function calling, and it cannot abstain when the target is absent.

duvo-eye-1.5 is duvo-eye-1 (v1) further trained with GRPO reinforcement learning under a point-in-bbox reward, then merged back to full bf16 weights (v1 + the GRPO LoRA folded in). v1 was itself a LoRA SFT of Hcompany/Holo-3.1-35B-A3B on Duvo's private SynthUI corpus. The RL stage delivers a small, clean, across-the-board gain over v1 — higher on all three public grounding benchmarks we track, with no regressions and 0% parse failures. It refines grounding precision rather than adding capability; it is not a large jump and not state of the art, and we say so plainly below.

Field Value
Lineage duvoai/duvo-eye-1 (GRPO RL) ← LoRA SFT of Hcompany/Holo-3.1-35B-A3B
Architecture Qwen3_5MoeForConditionalGeneration (qwen3_5_moe) — 35B-A3B MoE, ~3B active params
Task image-text-to-text (single-step GUI grounding)
Output {"x": int, "y": int} in [0, 1000], scaled to original screenshot pixels
Weights full merged model, bf16, ~66 GB
Decoding greedy, enable_thinking=False, max_new_tokens=64
Language English (inherited from the base)
License Apache 2.0

⚠️ Disable thinking for grounding — this is not a reasoning model

duvo-eye-1.5 is a direct grounder: it is meant to emit the coordinate immediately, with no chain-of-thought. It inherits the Holo-3.1 / Qwen3.5 chat template, whose thinking mode is ON by default, so you must call apply_chat_template(..., enable_thinking=False). That injects an empty <think></think> block and the model outputs the JSON point directly — exactly how it was trained and evaluated. Left in the default thinking mode it falls back to base-style reasoning and will frequently not produce a parseable coordinate within a short token budget. Thinking does not improve grounding here; it must be turned off. See Usage.

Results

All numbers come from one fixed, validated harness with enable_thinking=False and deterministic (greedy) decoding. The harness is calibrated against v1's published result: it reproduces duvo-eye-1's ScreenSpot-Pro 72.9 exactly (72.99% on the full 1,581 samples), so the v1 → v1.5 deltas below are a trustworthy, apples-to-apples A/B. Figures are as of 2026-06.

Benchmark duvo-eye-1 (v1) duvo-eye-1.5 Δ
ScreenSpot-Pro (1,581) 72.99% 73.31% +0.32 pp
OSWorld-G (510) 80.2% 80.78% +0.58 pp
UI-I2E-Bench (1,477) 84.2% 84.90% +0.70 pp

Parse failures were 0% on every board for both models. The v1 column is v1 re-measured under this same harness (the A/B reference); for OSWorld-G that re-measurement (80.2) runs a couple of points above v1's separately-published 78.0/510 due to harness settings — what is comparable here is the within-harness Δ, not the cross-harness absolute.

Honest framing. This is a real but modest improvement — a clean, monotone uplift from RL grounding refinement on an already-strong model. The headline gains are fractions of a point per board. It is not SOTA and not a step change; if you already run duvo-eye-1, expect a small, safe upgrade, not a new tier of capability.

By target type, the model is much stronger on text targets (82%) than on icon targets (60%). That icon / tiny-target gap is the main remaining weakness, and the RL stage — trained at ~1M px while ScreenSpot-Pro screenshots are 4K — does not close it (see Limitations).

Positioning

The credible pitch is efficiency, reliability, and single-model grounding quality — not "beats frontier." At ~3B active parameters, the v1 lineage sits among the very top single-forward-pass models on ScreenSpot-Pro: v1's 72.9 was verified as #2 of 86 entries on the official leaderboard (2026-06-13), behind only one 8B model and ahead of every larger single model; v1.5's 73.31 nudges that up. The handful of entries scoring higher are multi-step scaffolds and ensembles (iterative zoom, agentic refinement; top ≈80.9) — a separate, more expensive inference class that composes on top of any grounder. duvo-eye-1.5 is a strong, cheap single-model grounder; it is not an agent and does not beat frontier models on agentic benchmarks. For the full landscape (ScreenSpot-v2, UI-I2E, UI-Vision, WebClick, Showdown, and the verified-vs-self-reported breakdown), see the duvo-eye-1 card — those boards were measured on the v1 lineage and were not re-run for v1.5.

Usage

duvo-eye-1.5 is a standard transformers image-text-to-text model. Disable thinking, use the exact grounding prompt below, and parse the JSON point. The returned {x, y} are normalized to [0, 1000]; scale by the original screenshot dimensions, not the downscaled model input.

import json
import re
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "duvoai/duvo-eye-1.5"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto",
)

image = Image.open("screenshot.png").convert("RGB")
target = "the Save button in the top toolbar"

prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    ' * You must output a valid JSON following the format: {"x": int 0-1000, "y": int 0-1000}\n'
    " Your target is:\n" + target
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    enable_thinking=False,  # REQUIRED: emit the coordinate directly, not reasoning
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=64, do_sample=False)
output = processor.batch_decode(
    generated[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)[0]

# Parse {"x": int 0-1000, "y": int 0-1000}
point = json.loads(re.search(r"\{.*?\}", output, re.DOTALL).group(0))
x_norm, y_norm = point["x"], point["y"]

# Scale to original pixels
w, h = image.size
x_px = round(x_norm / 1000 * w)
y_px = round(y_norm / 1000 * h)
print(x_px, y_px)

Notes:

  • Always pass enable_thinking=False. This injects an empty <think></think> block so the model emits the JSON coordinate directly. Without it the model produces reasoning text and may not reach a parseable point.
  • Use greedy / deterministic decoding (do_sample=False) for reproducible grounding.
  • For high-resolution screenshots (4K, professional software), raise the processor's max_pixels so the input is not over-downscaled — resolution is the single biggest lever on dense, small-icon UIs.

Serving

bf16 weights are ~66 GB:

  • vLLM TP=1 on one 141 GB H200, or TP=2 on 2×80 GB.
  • Tune max_pixels for input resolution; for 4K professional-software screenshots, raise it so fine icons aren't over-downscaled. The [0, 1000] output is resolution-independent — always rescale by the original screenshot size.

Training

duvo-eye-1.5 = duvo-eye-1 + a GRPO reinforcement-learning stage, merged.

Method. GRPO (via TRL) with an attention-only LoRA (rank 16, on q/k/v/o), beta = 0, 200 steps on 4×H100. Rollouts were generated with enable_thinking=False (v1 already grounds the RL data at ~80% greedy, which gives a dense reward signal), then the trained LoRA was folded into v1 to produce the released full model.

Reward. A point-in-bbox reward plus a small format reward:

  • 1.0 if the predicted point falls inside the ground-truth bounding box;
  • otherwise 0.25 · exp(−dist / 150) distance shaping toward the box;
  • a small format reward for emitting a parseable, in-range {x, y}.

Data. ~6.5k grounding prompts mined from ServiceNow/GroundCUA (open-source professional-desktop GUIs), trained at ~1M px with bounding boxes normalized to [0, 1000]. The train/eval resolution gap (1M px training vs 4K test screenshots) is why icon precision did not improve much — higher-resolution RL is the next lever.

For the v1 SFT recipe (LoRA on Holo-3.1-35B-A3B over the private duvoai/SynthUI corpus), see the duvo-eye-1 card.

Intended use

  • GUI element grounding in a computer-use / desktop-automation pipeline: map a textual target description to a single click point on a screenshot.
  • As the grounding stage behind a separate planner (which decides what to click) and, optionally, a verifier / test-time-scaling layer on top.
  • Enterprise back-office and professional-software UIs (the v1 lineage was tuned on synthetic back-office UIs; the RL stage used open-source professional-desktop screenshots).

Limitations

  • Icons and tiny targets are the weak spot. The model is much stronger on text targets (~82%) than on icon targets (~60%); the RL stage does not close this gap. Dense professional UIs with small icon controls remain the hardest case.
  • Disable thinking — it is not a reasoning model. You must call apply_chat_template(..., enable_thinking=False) for direct grounding. The inherited template defaults to thinking ON; left on, the model emits reasoning instead of a coordinate and often fails to produce a parseable point within a short token budget. Thinking does not improve grounding here.
  • It always returns a coordinate and cannot abstain. There is no mechanism to say "the target is not present." If the described element is absent, the model still emits a (wrong) click point. Handle absence at the pipeline level.
  • Grounder, not agent. No planning, navigation, multi-step execution, or function calling. It resolves one description to one click and nothing more.
  • Modest gain over v1. Expect a small, clean uplift (fractions of a point per board), not a step change.

Citation

@misc{duvo2026eye15,
  title  = {duvo-eye-1.5: a GRPO-refined GUI grounding model},
  author = {Duvo AI},
  year   = {2026},
  url    = {https://huggingface.co/duvoai/duvo-eye-1.5},
}

Built on Hcompany/Holo-3.1-35B-A3B (H Company) and refined with ServiceNow/GroundCUA. License: Apache 2.0, same as the base.

Downloads last month
38
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for duvoai/duvo-eye-1.5

Finetuned
(1)
this model