metadata

license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: image-text-to-text

OS-Atlas: A Foundation Action Model For Generalist GUI Agents

[🏠Homepage] [💻Code] [📝Paper] [🤗Models][🤗Data]

Overview

We introduce OS-Genesis, an interaction-driven pipeline that synthesizes high-quality and diverse GUI agent trajectory data without human supervision. By leveraging reverse task synthesis, OS-Genesis enables effective training of GUI agents to achieve superior performance on dynamic benchmarks such as AndroidWorld and WebArena.

Quick Start

OS-Genesis-7B-AC is a mobile action model finetuned from Qwen2-VL-7B-Instruct.

OS-Genesis AC Family Models

In the following table, we provide an overview of the OS-Genesis AC Family Models used for evaluating the AndroidControl Benchmark.

Model Name	Base Model	Training Data	HF Link
OS-Genesis-4B-AC	InternVL2-4B	OS-Genesis-ac-training-data	🤗 link
OS-Genesis-7B-AC	Qwen2-VL-7B-Instruct	OS-Genesis-ac-training-data	🤗 link
OS-Genesis-8B-AC	InternVL2-8B	OS-Genesis-ac-training-data	🤗 link

Inference Example

First, ensure that the necessary dependencies are installed:

pip install transformers
pip install qwen-vl-utils

For evaluating the AndroidControl Benchmark, please refer to the evaluation code.

Inference code example:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "OS-Copilot/OS-Genesis-7B-AC", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
            },
            {"type": "text", "text": "You are a GUI task expert, I will provide you with a high-level instruction, an action history, a screenshot with its corresponding accessibility tree.\n High-level instruction: {high_level_instruction}\n Action history: {action_history}\n Accessibility tree: {a11y_tree}\n  Please generate the low-level thought and action for the next step."},
        ],
    }
]


# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
print(output_text)
# <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>

Citation

If you find this repository helpful, feel free to cite our paper:

@article{wu2024atlas,
        title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
        author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
        journal={arXiv preprint arXiv:2410.23218},
        year={2024}
      }