QiushiSun commited on
Commit
88eeddf
·
verified ·
1 Parent(s): ea35f88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -3
README.md CHANGED
@@ -1,3 +1,155 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: OpenGVLab/InternVL2-4B
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # OS-Atlas: A Foundation Action Model For Generalist GUI Agents
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045)[\[🤗Data\]](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
+
14
+ </div>
15
+
16
+ ## Overview
17
+ ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
18
+
19
+ OS-Atlas provides a series of models specifically designed for GUI agents.
20
+
21
+ For GUI grounding tasks, you can use:
22
+ - [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B)
23
+ - [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B)
24
+
25
+ For generating single-step actions in GUI agent tasks, you can use:
26
+ - [OS-Atlas-Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B)
27
+ - [OS-Atlas-Pro-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-4B)
28
+
29
+ ## Quick Start
30
+ OS-Atlas-Base-4B is a GUI grounding model finetuned from [InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B).
31
+
32
+ **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
33
+
34
+ ### Inference Example
35
+ First, install the `transformers` library:
36
+ ```
37
+ pip install transformers
38
+ ```
39
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
40
+
41
+ Then download the [example image](https://github.com/OS-Copilot/OS-Atlas/blob/main/examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png) and save it to the current directory.
42
+
43
+ Inference code example:
44
+ ```python
45
+ import numpy as np
46
+ import torch
47
+ import torchvision.transforms as T
48
+ from PIL import Image
49
+ from torchvision.transforms.functional import InterpolationMode
50
+ from transformers import AutoModel, AutoTokenizer
51
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
52
+ IMAGENET_STD = (0.229, 0.224, 0.225)
53
+
54
+ def build_transform(input_size):
55
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
56
+ transform = T.Compose([
57
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
58
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
59
+ T.ToTensor(),
60
+ T.Normalize(mean=MEAN, std=STD)
61
+ ])
62
+ return transform
63
+
64
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
65
+ best_ratio_diff = float('inf')
66
+ best_ratio = (1, 1)
67
+ area = width * height
68
+ for ratio in target_ratios:
69
+ target_aspect_ratio = ratio[0] / ratio[1]
70
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
71
+ if ratio_diff < best_ratio_diff:
72
+ best_ratio_diff = ratio_diff
73
+ best_ratio = ratio
74
+ elif ratio_diff == best_ratio_diff:
75
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
76
+ best_ratio = ratio
77
+ return best_ratio
78
+
79
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
80
+ orig_width, orig_height = image.size
81
+ aspect_ratio = orig_width / orig_height
82
+
83
+ # calculate the existing image aspect ratio
84
+ target_ratios = set(
85
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
86
+ i * j <= max_num and i * j >= min_num)
87
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
88
+
89
+ # find the closest aspect ratio to the target
90
+ target_aspect_ratio = find_closest_aspect_ratio(
91
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
92
+
93
+ # calculate the target width and height
94
+ target_width = image_size * target_aspect_ratio[0]
95
+ target_height = image_size * target_aspect_ratio[1]
96
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
97
+
98
+ # resize the image
99
+ resized_img = image.resize((target_width, target_height))
100
+ processed_images = []
101
+ for i in range(blocks):
102
+ box = (
103
+ (i % (target_width // image_size)) * image_size,
104
+ (i // (target_width // image_size)) * image_size,
105
+ ((i % (target_width // image_size)) + 1) * image_size,
106
+ ((i // (target_width // image_size)) + 1) * image_size
107
+ )
108
+ # split the image
109
+ split_img = resized_img.crop(box)
110
+ processed_images.append(split_img)
111
+ assert len(processed_images) == blocks
112
+ if use_thumbnail and len(processed_images) != 1:
113
+ thumbnail_img = image.resize((image_size, image_size))
114
+ processed_images.append(thumbnail_img)
115
+ return processed_images
116
+
117
+ def load_image(image_file, input_size=448, max_num=12):
118
+ image = Image.open(image_file).convert('RGB')
119
+ transform = build_transform(input_size=input_size)
120
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
121
+ pixel_values = [transform(image) for image in images]
122
+ pixel_values = torch.stack(pixel_values)
123
+ return pixel_values
124
+
125
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
126
+ path = 'OS-Copilot/OS-Genesis-8B-AC'
127
+ model = AutoModel.from_pretrained(
128
+ path,
129
+ torch_dtype=torch.bfloat16,
130
+ low_cpu_mem_usage=True,
131
+ trust_remote_code=True).eval().cuda()
132
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
133
+
134
+ # set the max number of tiles in `max_num`
135
+ pixel_values = load_image('./web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
136
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
137
+
138
+ question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\n\"'Champions League' link\""
139
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
140
+ print(f'User: {question}\nAssistant: {response}')
141
+ ```
142
+
143
+
144
+
145
+
146
+ ## Citation
147
+ If you find this repository helpful, feel free to cite our paper:
148
+ ```bibtex
149
+ @article{wu2024atlas,
150
+ title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
151
+ author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
152
+ journal={arXiv preprint arXiv:2410.23218},
153
+ year={2024}
154
+ }
155
+ ```