Enhance model card: Add pipeline tag, library name, abstract, and evaluation results (#5)

Browse files

- Enhance model card: Add pipeline tag, library name, abstract, and evaluation results (cd5156b12794925b077da71b19f38428a829cbda)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +67 -4

README.md CHANGED Viewed

@@ -1,16 +1,25 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
 tags:
 - GUI
 - multimodal
 ---
 ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090.
 - **Repository:** https://github.com/Han1018/ZonUI-3B
-- **Paper:** https://arxiv.org/abs/2506.23491
 ## ⭐ Quick start
@@ -135,4 +144,58 @@ try:
     display(result_image)
 except Exception as e:
     print(f"Error parsing coordinates: {e}")
 ```

 ---
 base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
+language:
+- en
+license: apache-2.0
 tags:
 - GUI
 - multimodal
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 ZonUI-3B — A lightweight, resolution-aware GUI grounding model trained with only 24K samples on a single RTX 4090.
+This model was presented in the paper [ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding](https://huggingface.co/papers/2506.23491).
+## Abstract
+In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: this https URL
 - **Repository:** https://github.com/Han1018/ZonUI-3B
+- **Paper (arXiv):** https://arxiv.org/abs/2506.23491
 ## ⭐ Quick start
     display(result_image)
 except Exception as e:
     print(f"Error parsing coordinates: {e}")
+```
+## 🎉 Main Results
+### ScreenSpot
+| Grounding Model          | Avg Score  | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
+|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
+| **General Models**       |        |             |             |               |                |           |           |
+| Qwen2.5-VL-3B            | 55.5   | -           | -           | -             | -              | -         | -         |
+| InternVL3-8B             | 79.5   | -           | -           | -             | -              | -         | -         |
+| Claude3.5 Sonnet         | 83.0   | -           | -           | -             | -              | -         | -         |
+| Gemini-2 Flash           | 84.0   | -           | -           | -             | -              | -         | -         |
+| Qwen2.5-VL-7B            | 84.7   | -           | -           | -             | -              | -         | -         |
+| **GUI-specific Models**  |        |             |             |               |                |           |           |
+| CogAgent-18B             | 47.4   | 67.0        | 24.0        | 74.2          | 20.0           | 70.4      | 28.6      |
+| SeeClick-9.6B            | 53.4   | 78.0        | 52.0        | 72.2          | 30.0           | 55.7      | 32.5      |
+| OmniParser               | 73.0   | 93.9        | 57.0        | 91.3          | 63.6           | 81.3      | 51.0      |
+| UGround-7B               | 73.3   | 82.8        | 60.3        | 82.5          | 63.6           | 80.4      | 70.4      |
+| ShowUI-2B                | 75.0   | 91.6        | 69.0        | 81.8          | 59.0           | 83.0      | 65.5      |
+| UI-TARS-2B               | 82.3   | 93.0        | 75.5        | 90.7          | 68.6           | 84.3      | 74.8      |
+| OS-Atlas-7B              | 82.5   | 93.0        | 72.9        | 91.8          | 62.9           | 90.9      | 74.3      |
+| Aguvis-7B                | 84.4   | 95.6        | 77.7        | 93.8          | 67.1           | 88.3      | 75.2      |
+| **ZonUI-3B**          | **84.9** | **96.3**    | **81.6**    | **93.8**      | **74.2**       | 89.5      | 74.2      |
+### ScreenSpot-v2
+| Grounding Model          | Avg Score  | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon |
+|--------------------------|--------|-------------|-------------|---------------|----------------|-----------|-----------|
+| **General Models**       |        |             |             |               |                |           |           |
+| InternVL3-8B             | 81.4   | -           | -           | -             | -              | -         | -         |
+| **GUI-specific Models**  |        |             |             |               |                |           |           |
+| SeeClick-9.6B            | 55.1   | 78.4        | 50.7        | 70.1          | 29.3           | 55.2      | 32.5      |
+| UGround-7B               | 76.3   | 84.5        | 61.6        | 85.1          | 61.4           | 84.6      | 71.9      |
+| ShowUI-2B                | 77.3   | 92.1        | 75.4        | 78.9          | 59.3           | 84.2      | 61.1      |
+| OS-Atlas-7B              | 84.1   | 95.1        | 75.8        | 90.7          | 63.5           | 90.6      | 77.3      |
+| UI-TARS-2B               | 84.7   | 95.2        | 79.1        | 90.7          | 68.6           | 87.2      | 78.3      |
+| **ZonUI-3B**        | **86.4** | **97.9**    | **84.8**    | **93.8**      | **75.0**       | **91.0**  | 75.8      |
+## 🫶 Acknowledgement
+We would like to acknowledge [ShowUI](https://github.com/showlab/ShowUI) for making their code and data publicly available, which was instrumental to our development.
+## ✍️ BibTeX
+If our work contributes to your research, we would appreciate it if you could cite our paper.
+```
+@misc{hsieh2025zonui3b,
+  title        = {ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding},
+  author       = {Hsieh, ZongHan and Wei, Tzer-Jen and Yang, ShengJing},
+  year         = {2025},
+  howpublished = {\url{https://arxiv.org/abs/2506.23491}},
+  note         = {arXiv:2506.23491 [cs.CV], version 2, last revised 1 Jul 2025}
+}
 ```