UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning
Paper • 2505.12493 • Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Implementation of GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning (arXiv:2505.12493).
This repo provides a complete training pipeline for GUI-Shift using native TRL GRPOTrainer with Qwen2.5-VL.
build_data.py: Constructs K-step GUI Transition dataset from AndroidControltrain_gui_shift.py: GRPO training with rule-based rewards (format + action correctness)run_train.sh: Launch script with hyperparameters matching the paperBuild dataset: python build_data.py
Train: bash run_train.sh
| Parameter | Value |
|---|---|
| learning_rate | 1e-6 |
| num_generations | 8 |
| num_train_epochs | 4 |
| max_prompt_length | 1024 |
| max_completion_length | 256 |
| per_device_train_batch_size | 2 |
| gradient_accumulation_steps | 8 |
| beta (KL) | 0.04 |
| epsilon (clip) | 0.2 |
| temperature | 0.9 |
| bf16 | true |
The unified action space covers 8 types:
click, long_press (with x,y coordinates)scroll (with direction)open_app (with app_name)input_text (with text content)navigate_back, navigate_home, wait<answer>...</answer> tags