GUI-Shift Implementation

Implementation of GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning (arXiv:2505.12493).

Overview

This repo provides a complete training pipeline for GUI-Shift using native TRL GRPOTrainer with Qwen2.5-VL.

build_data.py: Constructs K-step GUI Transition dataset from AndroidControl
train_gui_shift.py: GRPO training with rule-based rewards (format + action correctness)
run_train.sh: Launch script with hyperparameters matching the paper

Build dataset: python build_data.py
- Downloads AndroidControl (ckg/AndroidControlParsedWithImages-20k)
- Extracts K-step state pairs (default k=1)
- Saves 2000 transition samples as JSONL
Train: bash run_train.sh
- Loads Qwen2.5-VL-7B-Instruct
- Freezes vision encoder + projector
- Trains with GRPO (format_reward + action_reward)
- Pushes to Hugging Face Hub

Parameter	Value
learning_rate	1e-6
num_generations	8
num_train_epochs	4
max_prompt_length	1024
max_completion_length	256
per_device_train_batch_size	2
gradient_accumulation_steps	8
beta (KL)	0.04
epsilon (clip)	0.2
temperature	0.9
bf16	true

The unified action space covers 8 types:

Format reward: +1 if output contains <answer>...</answer> tags
Action reward: +1 if parsed action matches ground truth
- Click/long_press: point falls within ground-truth bounding box
- Scroll/open_app/input_text: exact parameter match
- navigate_back/navigate_home/wait: action type match only

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support