eric1993 commited on
Commit
e196881
Β·
verified Β·
1 Parent(s): 408d14b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -3
README.md CHANGED
@@ -1,3 +1,148 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ <h1 align="center">
5
+ <img src="static/images/BA_icon.jpg" width="100" alt="AndroidControl-Curated Logo" />
6
+ <br>
7
+ AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification
8
+ </h1>
9
+
10
+ <p align="center">
11
+ <a href="https://arxiv.org/abs/2510.18488v1"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?style=flat&logo=arxiv&logoColor=white" alt="arXiv Paper"></a>
12
+ <a href="https://huggingface.co/datasets/batwBMW/AndroidControl_Curated"><img src="https://img.shields.io/badge/πŸ€—%20HuggingFace-Dataset-ff9800?style=flat" alt="Hugging Face Dataset"></a>
13
+ <a href="https://huggingface.co/batwBMW/Magma-R1"><img src="https://img.shields.io/badge/πŸ€—%20HuggingFace-Models-ff9800?style=flat" alt="Hugging Face Model"></a>
14
+ </p>
15
+
16
+ <br>
17
+ <p align="center">
18
+ <strong>This is the official repository for the paper <a href="https://github.com/batechworks/AndroidControl_Curated">AndroidControl-Curated</a>.</strong>
19
+ </p>
20
+
21
+ ## 🌟 Overview
22
+
23
+ On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as ~3B parameter models score as low as 60% on benchmarks like AndroidControl, far from viability for real-world use.
24
+
25
+ Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into **AndroidControl-Curated**, a refined version of the benchmark improved through a rigorous purification pipeline.
26
+
27
+ On this enhanced benchmark, state-of-the-art models achieve success rates nearing 80% on complex tasks, reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We also trained our new SOTA model, **Magma-R1**, on just 2,400 curated samples, which matches the performance of previous models trained on over 31,000 samples.
28
+
29
+ <div align="center">
30
+ <img src="static/images/method_1021_1355-compress.png" width="90%" alt="Method Overview">
31
+ <p><i>Overview of our integrated pipeline for Magma-R1 training and AndroidControl-Curated creation.</i></p>
32
+ </div>
33
+
34
+ ## πŸ”₯ News
35
+ - πŸ”₯ ***`2025/10/21`*** Our paper "[AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification](YOUR_ARXIV_PAPER_LINK)" released.
36
+
37
+ ## πŸš€ Updates
38
+ - ***`2025/10/21`*** The source code for `AndroidControl-Curated` and `Magma-R1` has been released.
39
+
40
+ ## πŸ“Š Results
41
+
42
+ ### Table 1. Performance comparison of GUI agents on AndroidControl-Curated
43
+ *Grounding Accuracy (GA) for all models is evaluated using our proposed E_bbox. The best results are in **bold**, and the second best are <u>underlined</u>. "-" indicates results to be added.*
44
+
45
+ | Model | AndroidControl-Curated-Easy ||| AndroidControl-Curated-Hard |||
46
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
47
+ || Type (%) | Grounding (%) | SR (%) | Type (%) | Grounding (%) | SR (%) |
48
+ | ***Proprietary Models*** | | | | | | |
49
+ | GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 |
50
+ | ***Open-source Models*** | | | | | | |
51
+ | OS-Atlas-4B | **91.9** | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 |
52
+ | UI-R1 | 62.2 | 93.6 | 58.9 | 54.4 | 79.3 | 43.6 |
53
+ | GUI-R1-3B | 69.5 | <u>94.7</u> | 67.1 | 63.1 | 80.3 | 54.4 |
54
+ | GUI-R1-7B | 74.9 | **95.9** | 72.7 | 66.5 | 82.6 | 57.5 |
55
+ | Infi-GUI-R1 (trained on 31k origin data) | 90.2 | 93.7 | <u>87.2</u> | 78.5 | 72.8 | 70.7 |
56
+ | Qwen3-VL-30B | 82.8 | 80.7 | 70.5 | <u>85.9</u> | 78.9 | 70.0 |
57
+ | Qwen3-VL-235B | 85.1 | 82.9 | 74.5 | **88.2** | <u>83.6</u> | **76.5** |
58
+ | ***Ours*** | | | | | | |
59
+ | Magma-R1 | <u>91.3</u> | 94.2 | **88.0** | <u>84.2</u> | **84.8** | <u>75.3</u> |
60
+
61
+ ### Table 2. Ablation analysis of the benchmark purification process on the Hard subset
62
+ *SR Impr. (G) shows the SR gain from AndroidControl to AndroidControl-Curated-Box. SR Impr. (T) shows the SR gain from AndroidControl-Curated-Box to the final AndroidControl-Curated. Best results are in **bold**, second best are <u>underlined</u>.*
63
+
64
+ | Model | AndroidControl ||| AndroidControl-Curated-Box |||| AndroidControl-Curated ||||
65
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
66
+ || Type (%) | Grounding (%) | SR (%) | Type (%) | Grounding (%) | SR (%) | SR Impr. (G) | Type (%) | Grounding (%) | SR (%) | SR Impr. (T) |
67
+ | GUI-R1-3B | 57.2 | 59.0 | 41.5 | 59.3 | 74.0 | 49.4 | +7.9 | 63.1 | 80.3 | 54.4 | +5.0 |
68
+ | GUI-R1-7B | 62.5 | 65.1 | 46.3 | 63.3 | 76.9 | 53.2 | +6.9 | 66.5 | 82.6 | 57.5 | +4.3 |
69
+ | Infi-GUI-R1 | <u>77.0</u> | 57.0 | <u>59.0</u> | 77.7 | 69.5 | 67.6 | +8.6 | 78.5 | 72.8 | 70.7 | +3.1 |
70
+ | Qwen3-VL-235B | 67.3 | **78.3** | **61.2** | **82.9** | **79.9** | **71.7** | +10.5 | **88.2** | <u>83.6</u> | **76.5** | +4.8 |
71
+ | Magma-R1 | **78.2** | 58.2 | 57.6 | <u>80.0</u> | <u>77.1</u> | <u>69.1</u> | **+11.5** | <u>84.2</u> | **84.8** | <u>75.3</u> | **+6.2** |
72
+
73
+ ## πŸš€ Setup & Installation
74
+
75
+ 1. **Clone the repository:**
76
+ ```bash
77
+ git clone https://github.com/YourUsername/YourRepoName.git
78
+ cd YourRepoName
79
+ ```
80
+
81
+ 2. **Install dependencies:**
82
+ We recommend using a virtual environment (e.g., conda or venv).
83
+ ```bash
84
+ pip install -r requirement.py
85
+ ```
86
+
87
+ ## πŸ§ͺ Evaluation
88
+
89
+ To reproduce the results on `AndroidControl-Curated`:
90
+
91
+ 1. **Download the benchmark data:**
92
+ Download the processed test set from [Hugging Face](YOUR_HUGGINGFACE_DATASET_LINK) and place it in the `benchmark_resource/` directory. The directory should contain the following files:
93
+ - `android_control_high_bbox.json`
94
+ - `android_control_high_point.json`
95
+ - `android_control_low_bbox.json`
96
+ - `android_control_low_point.json`
97
+ - `android_control_high_task-improved.json`
98
+
99
+ 2. **Download the model:**
100
+ Download the `Magma-R1` model weights from [Hugging Face](YOUR_HUGGINGFACE_MODEL_LINK) and place them in your desired location.
101
+
102
+ 3. **Run the evaluation script:**
103
+ Execute the following command, making sure to update the paths to your model and the benchmark image directory.
104
+ ```bash
105
+ python eval/evaluate_actions_androidControl_vllm.py \
106
+ --model_path /path/to/your/Magma-R1-model \
107
+ --save_name Your_Results.xlsx \
108
+ --image_dir /path/to/your/benchmark_images_directory
109
+ ```
110
+
111
+ ## πŸ› οΈ Methodology
112
+
113
+ Our methodology consists of two main parts:
114
+
115
+ ### Systematic Benchmark Purification: The AndroidControl-Curated Pipeline
116
+
117
+ 1. **Stage 1: From Coordinate Matching to Intent Alignment in Grounding Evaluation**
118
+ - Replace overly strict point-based matching with bounding-box-based intent alignment
119
+ - Evaluate whether predicted points fall within target UI element bounding boxes
120
+
121
+ 2. **Stage 2: Task-Level Correction via LLM-Human Collaboration**
122
+ - High-risk sample identification via execution consensus failure
123
+ - Automated causal attribution and correction with LLMs
124
+ - Rigorous human expert verification
125
+
126
+ ### Training Paradigm of Magma-R1: Optimization via GRPO
127
+
128
+ - **Dense Rewards**: Gaussian kernel-based grounding reward for continuous feedback
129
+ - **Balanced Learning**: Action type proportional optimization to address class imbalance
130
+ - **Efficient Training**: Generative REINFORCE with Policy Optimization (GRPO)
131
+
132
+ ## πŸ“š Citation Information
133
+
134
+ If you find this work useful, a citation to the following paper would be appreciated:
135
+
136
+ ```bibtex
137
+ @article{leung2025androidcontrolcurated,
138
+ title={AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification},
139
+ author={LEUNG Ho Fai (Kevin) and XI XiaoYan (Sibyl) and ZUO Fei (Eric)},
140
+ journal={arXiv preprint arXiv:XXXX.XXXXX},
141
+ year={2025},
142
+ institution={BMW ArcherMind Information Technology Co. Ltd. (BA TechWorks)}
143
+ }
144
+ ```
145
+
146
+ ## πŸ™ Acknowledgments
147
+
148
+ We thank the anonymous reviewers for their valuable feedback and suggestions. This work was made possible by the generous support of several organizations. We extend our sincere gratitude to ArcherMind for providing the high-performance computing resources essential for our experiments. We would also like to acknowledge the BMW Group for their significant administrative support. Furthermore, we are grateful to BA Techworks for invaluable technical support and collaboration throughout this project.