duvo-eye-1 โ€” benchmark evidence

Per-sample predictions and eval harness for duvoai/duvo-eye-1 public-benchmark results. Every claimed number can be independently rescored from the .predictions.jsonl files (one line per sample: raw model output, parsed point, ground truth, hit/miss).

Protocol: single-shot, temperature 0, max_tokens 64, thinking disabled, guided JSON {"x": int, "y": int} in [0,1000] scaled to original pixels, point-in-region scoring (unparseable output = miss). Harness: bench_eval.py (included).

Benchmark duvo-eye-1 Holo-3.1-35B-A3B (base, same harness) Files
ScreenSpot-Pro (1,581, max_pixels 8M)ยฒ 72.9 56.1* screenspot-pro/
ScreenSpot-v2 (1,272) 95.1 โ€” screenspot-v2/
OSWorld-G (510 standard subset) 78.0 64.9 osworld-g/
OSWorld-G (full 564, refusals as misses) 70.6 โ€” osworld-g/ (bench_oswg564_*)
OSWorld-G refined instructions (564 / 510) 75.0 / 82.9 โ€” osworld-g-refined/
WebClick (1,639) 93.6 โ€” webclick/
UI-I2E-Bench (1,477) 84.2 โ€” ui-i2e-bench/
UI-Vision element grounding (5,479; macro-avg) 64.4 โ€” ui-vision/
Showdown-Clicks dev (557)ยน 78.8 โ€” showdown-clicks/

* Base ScreenSpot-Pro was measured under the identical harness and verified at eval time, but its per-sample predictions were not retained; H Company's published number for the base model is 71.5 (different protocol, see their model card).

ยน Scored by Showdown's own official metric (predicted point inside the ground-truth bounding box, is_in_bbox) โ€” directly comparable to their leaderboard. n=557 dev split (95% CI โ‰ˆ ยฑ3.5pp).

ยฒ Also reproduced under the benchmark's official eval_screenspot_pro.py harness: 72.87 (1152/1581), matching our 72.93 within one sample (screenspot-pro/duvo_eye_1_official_harness.json).

Refusal note: guided decoding always emits a coordinate, so the model cannot earn OSWorld-G refusal credit; full-564 numbers score all 54 refusal items as misses.

Submissions: ScreenSpot-Pro PR #29 ยท OSWorld-G PR #24

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support