YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GUI-Shift Implementation

Implementation of GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning (arXiv:2505.12493).

Overview

This repo provides a complete training pipeline for GUI-Shift using native TRL GRPOTrainer with Qwen2.5-VL.

Key Components

  • build_data.py: Constructs K-step GUI Transition dataset from AndroidControl
  • train_gui_shift.py: GRPO training with rule-based rewards (format + action correctness)
  • run_train.sh: Launch script with hyperparameters matching the paper

Workflow

  1. Build dataset: python build_data.py

    • Downloads AndroidControl (ckg/AndroidControlParsedWithImages-20k)
    • Extracts K-step state pairs (default k=1)
    • Saves 2000 transition samples as JSONL
  2. Train: bash run_train.sh

    • Loads Qwen2.5-VL-7B-Instruct
    • Freezes vision encoder + projector
    • Trains with GRPO (format_reward + action_reward)
    • Pushes to Hugging Face Hub

Hyperparameters (from paper Appendix A)

Parameter Value
learning_rate 1e-6
num_generations 8
num_train_epochs 4
max_prompt_length 1024
max_completion_length 256
per_device_train_batch_size 2
gradient_accumulation_steps 8
beta (KL) 0.04
epsilon (clip) 0.2
temperature 0.9
bf16 true

Action Space

The unified action space covers 8 types:

  • click, long_press (with x,y coordinates)
  • scroll (with direction)
  • open_app (with app_name)
  • input_text (with text content)
  • navigate_back, navigate_home, wait

Reward Design

  • Format reward: +1 if output contains <answer>...</answer> tags
  • Action reward: +1 if parsed action matches ground truth
    • Click/long_press: point falls within ground-truth bounding box
    • Scroll/open_app/input_text: exact parameter match
    • navigate_back/navigate_home/wait: action type match only

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xxilu/gui-shift-implementation