duvo-eye-1 โ benchmark evidence
Per-sample predictions and eval harness for duvoai/duvo-eye-1 public-benchmark results. Every claimed number can be independently rescored from the .predictions.jsonl files (one line per sample: raw model output, parsed point, ground truth, hit/miss).
Protocol: single-shot, temperature 0, max_tokens 64, thinking disabled, guided JSON {"x": int, "y": int} in [0,1000] scaled to original pixels, point-in-region scoring (unparseable output = miss). Harness: bench_eval.py (included).
| Benchmark | duvo-eye-1 | Holo-3.1-35B-A3B (base, same harness) | Files |
|---|---|---|---|
| ScreenSpot-Pro (1,581, max_pixels 8M)ยฒ | 72.9 | 56.1* | screenspot-pro/ |
| ScreenSpot-v2 (1,272) | 95.1 | โ | screenspot-v2/ |
| OSWorld-G (510 standard subset) | 78.0 | 64.9 | osworld-g/ |
| OSWorld-G (full 564, refusals as misses) | 70.6 | โ | osworld-g/ (bench_oswg564_*) |
| OSWorld-G refined instructions (564 / 510) | 75.0 / 82.9 | โ | osworld-g-refined/ |
| WebClick (1,639) | 93.6 | โ | webclick/ |
| UI-I2E-Bench (1,477) | 84.2 | โ | ui-i2e-bench/ |
| UI-Vision element grounding (5,479; macro-avg) | 64.4 | โ | ui-vision/ |
| Showdown-Clicks dev (557)ยน | 78.8 | โ | showdown-clicks/ |
* Base ScreenSpot-Pro was measured under the identical harness and verified at eval time, but its per-sample predictions were not retained; H Company's published number for the base model is 71.5 (different protocol, see their model card).
ยน Scored by Showdown's own official metric (predicted point inside the ground-truth bounding box, is_in_bbox) โ directly comparable to their leaderboard. n=557 dev split (95% CI โ ยฑ3.5pp).
ยฒ Also reproduced under the benchmark's official eval_screenspot_pro.py harness: 72.87 (1152/1581), matching our 72.93 within one sample (screenspot-pro/duvo_eye_1_official_harness.json).
Refusal note: guided decoding always emits a coordinate, so the model cannot earn OSWorld-G refusal credit; full-564 numbers score all 54 refusal items as misses.