duvo-eye-1 — benchmark evidence

Per-sample predictions and eval harness for duvoai/duvo-eye-1 public-benchmark results. Every claimed number can be independently rescored from the .predictions.jsonl files (one line per sample: raw model output, parsed point, ground truth, hit/miss).

Protocol: single-shot, temperature 0, max_tokens 64, thinking disabled, guided JSON {"x": int, "y": int} in [0,1000] scaled to original pixels, point-in-region scoring (unparseable output = miss). Harness: bench_eval.py (included).

Benchmark	duvo-eye-1	Holo-3.1-35B-A3B (base, same harness)	Files
ScreenSpot-Pro (1,581, max_pixels 8M)²	72.9	56.1*	`screenspot-pro/`
ScreenSpot-v2 (1,272)	95.1	—	`screenspot-v2/`
OSWorld-G (510 standard subset)	78.0	64.9	`osworld-g/`
OSWorld-G (full 564, refusals as misses)	70.6	—	`osworld-g/` (`bench_oswg564_*`)
OSWorld-G refined instructions (564 / 510)	75.0 / 82.9	—	`osworld-g-refined/`
WebClick (1,639)	93.6	—	`webclick/`
UI-I2E-Bench (1,477)	84.2	—	`ui-i2e-bench/`
UI-Vision element grounding (5,479; macro-avg)	64.4	—	`ui-vision/`
Showdown-Clicks dev (557)¹	78.8	—	`showdown-clicks/`

* Base ScreenSpot-Pro was measured under the identical harness and verified at eval time, but its per-sample predictions were not retained; H Company's published number for the base model is 71.5 (different protocol, see their model card).

¹ Scored by Showdown's own official metric (predicted point inside the ground-truth bounding box, is_in_bbox) — directly comparable to their leaderboard. n=557 dev split (95% CI ≈ ±3.5pp).

² Also reproduced under the benchmark's official eval_screenspot_pro.py harness: 72.87 (1152/1581), matching our 72.93 within one sample (screenspot-pro/duvo_eye_1_official_harness.json).

Refusal note: guided decoding always emits a coordinate, so the model cannot earn OSWorld-G refusal credit; full-564 numbers score all 54 refusal items as misses.

Submissions: ScreenSpot-Pro PR #29 · OSWorld-G PR #24

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support