Enhance model card: Add pipeline tag, library name, abstract, and evaluation results (#5)
Browse files- Enhance model card: Add pipeline tag, library name, abstract, and evaluation results (cd5156b12794925b077da71b19f38428a829cbda)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,16 +1,25 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
|
|
|
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- GUI
|
| 9 |
- multimodal
|
|
|
|
|
|
|
| 10 |
---
|
|
|
|
| 11 |
ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- **Repository:** https://github.com/Han1018/ZonUI-3B
|
| 13 |
-
- **Paper:** https://arxiv.org/abs/2506.23491
|
| 14 |
|
| 15 |
## ⭐ Quick start
|
| 16 |
|
|
@@ -135,4 +144,58 @@ try:
|
|
| 135 |
display(result_image)
|
| 136 |
except Exception as e:
|
| 137 |
print(f"Error parsing coordinates: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
tags:
|
| 8 |
- GUI
|
| 9 |
- multimodal
|
| 10 |
+
pipeline_tag: image-text-to-text
|
| 11 |
+
library_name: transformers
|
| 12 |
---
|
| 13 |
+
|
| 14 |
ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090.
|
| 15 |
+
|
| 16 |
+
This model was presented in the paper [ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding](https://huggingface.co/papers/2506.23491).
|
| 17 |
+
|
| 18 |
+
## Abstract
|
| 19 |
+
In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: this https URL
|
| 20 |
+
|
| 21 |
- **Repository:** https://github.com/Han1018/ZonUI-3B
|
| 22 |
+
- **Paper (arXiv):** https://arxiv.org/abs/2506.23491
|
| 23 |
|
| 24 |
## ⭐ Quick start
|
| 25 |
|
|
|
|
| 144 |
display(result_image)
|
| 145 |
except Exception as e:
|
| 146 |
print(f"Error parsing coordinates: {e}")
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
## 🎉 Main Results
|
| 150 |
+
|
| 151 |
+
### ScreenSpot
|
| 152 |
+
|
| 153 |
+
| Grounding Model | Avg Score | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
|
| 154 |
+
|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
|
| 155 |
+
| **General Models** | | | | | | | |
|
| 156 |
+
| Qwen2.5-VL-3B | 55.5 | - | - | - | - | - | - |
|
| 157 |
+
| InternVL3-8B | 79.5 | - | - | - | - | - | - |
|
| 158 |
+
| Claude3.5 Sonnet | 83.0 | - | - | - | - | - | - |
|
| 159 |
+
| Gemini-2 Flash | 84.0 | - | - | - | - | - | - |
|
| 160 |
+
| Qwen2.5-VL-7B | 84.7 | - | - | - | - | - | - |
|
| 161 |
+
| **GUI-specific Models** | | | | | | | |
|
| 162 |
+
| CogAgent-18B | 47.4 | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 |
|
| 163 |
+
| SeeClick-9.6B | 53.4 | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 |
|
| 164 |
+
| OmniParser | 73.0 | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 |
|
| 165 |
+
| UGround-7B | 73.3 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 |
|
| 166 |
+
| ShowUI-2B | 75.0 | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 |
|
| 167 |
+
| UI-TARS-2B | 82.3 | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 |
|
| 168 |
+
| OS-Atlas-7B | 82.5 | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 |
|
| 169 |
+
| Aguvis-7B | 84.4 | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 |
|
| 170 |
+
| **ZonUI-3B** | **84.9** | **96.3** | **81.6** | **93.8** | **74.2** | 89.5 | 74.2 |
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
### ScreenSpot-v2
|
| 174 |
+
|
| 175 |
+
| Grounding Model | Avg Score | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
|
| 176 |
+
|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
|
| 177 |
+
| **General Models** | | | | | | | |
|
| 178 |
+
| InternVL3-8B | 81.4 | - | - | - | - | - | - |
|
| 179 |
+
| **GUI-specific Models** | | | | | | | |
|
| 180 |
+
| SeeClick-9.6B | 55.1 | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 |
|
| 181 |
+
| UGround-7B | 76.3 | 84.5 | 61.6 | 85.1 | 61.4 | 84.6 | 71.9 |
|
| 182 |
+
| ShowUI-2B | 77.3 | 92.1 | 75.4 | 78.9 | 59.3 | 84.2 | 61.1 |
|
| 183 |
+
| OS-Atlas-7B | 84.1 | 95.1 | 75.8 | 90.7 | 63.5 | 90.6 | 77.3 |
|
| 184 |
+
| UI-TARS-2B | 84.7 | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 |
|
| 185 |
+
| **ZonUI-3B** | **86.4** | **97.9** | **84.8** | **93.8** | **75.0** | **91.0** | 75.8 |
|
| 186 |
+
|
| 187 |
+
## 🫶 Acknowledgement
|
| 188 |
+
We would like to acknowledge [ShowUI](https://github.com/showlab/ShowUI) for making their code and data publicly available, which was instrumental to our development.
|
| 189 |
+
|
| 190 |
+
## ✍️ BibTeX
|
| 191 |
+
If our work contributes to your research, we would appreciate it if you could cite our paper.
|
| 192 |
+
|
| 193 |
+
```
|
| 194 |
+
@misc{hsieh2025zonui3b,
|
| 195 |
+
title = {ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding},
|
| 196 |
+
author = {Hsieh, ZongHan and Wei, Tzer-Jen and Yang, ShengJing},
|
| 197 |
+
year = {2025},
|
| 198 |
+
howpublished = {\url{https://arxiv.org/abs/2506.23491}},
|
| 199 |
+
note = {arXiv:2506.23491 [cs.CV], version 2, last revised 1 Jul 2025}
|
| 200 |
+
}
|
| 201 |
```
|