KevinQHLin commited on
Commit
b6fbea9
1 Parent(s): 39ae05d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -2
README.md CHANGED
@@ -1,4 +1,4 @@
1
- [Github](https://github.com/showlab/ShowUI) | [Quick Start](#quickstart)
2
 
3
  ## ⭐ Quick Start {#quickstart}
4
 
@@ -34,7 +34,7 @@ max_pixels = 1344*28*28
34
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
35
  ```
36
 
37
- 2. Load screenshot and query
38
  ```python
39
  img_url = 'web_dbd7514b-9ca3-40cd-b09a-990f7b955da1.png'
40
  query = "Nahant"
@@ -82,3 +82,98 @@ draw_point(img_url, click_xy, 10)
82
  This will visualize the grounding results like (where the red points are [x,y])
83
 
84
  ![download](https://github.com/user-attachments/assets/8fe2783d-05b6-44e6-a26c-8718d02b56cb)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [Github](https://github.com/showlab/ShowUI) | [Quick Start: UI-Grounding](#uigrounding) | [Quick Start: UI-Navigation](#uinavigation)
2
 
3
  ## ⭐ Quick Start {#quickstart}
4
 
 
34
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
35
  ```
36
 
37
+ 2. **UI Grounding** {#uigrounding}
38
  ```python
39
  img_url = 'web_dbd7514b-9ca3-40cd-b09a-990f7b955da1.png'
40
  query = "Nahant"
 
82
  This will visualize the grounding results like (where the red points are [x,y])
83
 
84
  ![download](https://github.com/user-attachments/assets/8fe2783d-05b6-44e6-a26c-8718d02b56cb)
85
+
86
+ 3. **UI Navigation** {#uinavigation}
87
+ - Set up system prompt.
88
+ ```python
89
+ _NAV_SYSTEM = """You are an assistant trained to navigate the {_APP} screen.
90
+ Given a task instruction, a screen observation, and an action history sequence,
91
+ output the next action and wait for the next observation.
92
+ Here is the action space:
93
+ {_ACTION_SPACE}
94
+ """
95
+
96
+ _NAV_FORMAT = """
97
+ Format the action as a dictionary with the following keys:
98
+ {'action': 'ACTION_TYPE', 'value': 'element', 'position': [x,y]}
99
+
100
+ If value or position is not applicable, set it as `None`.
101
+ Position might be [[x1,y1], [x2,y2]] if the action requires a start and end position.
102
+ Position represents the relative coordinates on the screenshot and should be scaled to a range of 0-1.
103
+ """
104
+
105
+ action_map = {
106
+ 'web': """
107
+ 1. `CLICK`: Click on an element, value is not applicable and the position [x,y] is required.
108
+ 2. `INPUT`: Type a string into an element, value is a string to type and the position [x,y] is required.
109
+ 3. `SELECT`: Select a value for an element, value is not applicable and the position [x,y] is required.
110
+ 4. `HOVER`: Hover on an element, value is not applicable and the position [x,y] is required.
111
+ 5. `ANSWER`: Answer the question, value is the answer and the position is not applicable.
112
+ 6. `ENTER`: Enter operation, value and position are not applicable.
113
+ 7. `SCROLL`: Scroll the screen, value is the direction to scroll and the position is not applicable.
114
+ 8. `SELECT_TEXT`: Select some text content, value is not applicable and position [[x1,y1], [x2,y2]] is the start and end position of the select operation.
115
+ 9. `COPY`: Copy the text, value is the text to copy and the position is not applicable.
116
+ """,
117
+
118
+ 'phone': """
119
+ 1. `INPUT`: Type a string into an element, value is not applicable and the position [x,y] is required.
120
+ 2. `SWIPE`: Swipe the screen, value is not applicable and the position [[x1,y1], [x2,y2]] is the start and end position of the swipe operation.
121
+ 3. `TAP`: Tap on an element, value is not applicable and the position [x,y] is required.
122
+ 4. `ANSWER`: Answer the question, value is the status (e.g., 'task complete') and the position is not applicable.
123
+ 5. `ENTER`: Enter operation, value and position are not applicable.
124
+ """
125
+ }
126
+
127
+ _NAV_USER = """{system}
128
+ Task: {task}
129
+ Observation: <|image_1|>
130
+ Action History: {action_history}
131
+ What is the next action?
132
+ """
133
+ ```
134
+
135
+ ```python
136
+ img_url = 'chrome.png'
137
+ split='web'
138
+ system_prompt = _NAV_SYSTEM.format(_APP=split, _ACTION_SPACE=action_map[split])
139
+ query = "Search the weather for the New York city."
140
+
141
+ messages = [
142
+ {
143
+ "role": "user",
144
+ "content": [
145
+ {"type": "text", "text": system_prompt},
146
+ {"type": "image", "image": img_url, "min_pixels": min_pixels, "max_pixels": max_pixels},
147
+ {"type": "text", "text": query}
148
+ ],
149
+ }
150
+ ]
151
+
152
+ text = processor.apply_chat_template(
153
+ messages, tokenize=False, add_generation_prompt=True,
154
+ )
155
+ image_inputs, video_inputs = process_vision_info(messages)
156
+ inputs = processor(
157
+ text=[text],
158
+ images=image_inputs,
159
+ videos=video_inputs,
160
+ padding=True,
161
+ return_tensors="pt",
162
+ )
163
+ inputs = inputs.to("cuda")
164
+
165
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
166
+ generated_ids_trimmed = [
167
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
168
+ ]
169
+ output_text = processor.batch_decode(
170
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
171
+ )[0]
172
+
173
+ print(output_text)
174
+ # {'action': 'CLICK', 'value': None, 'position': [0.49, 0.42]},
175
+ # {'action': 'INPUT', 'value': 'weather for New York city', 'position': [0.49, 0.42]},
176
+ # {'action': 'ENTER', 'value': None, 'position': None}
177
+ ```
178
+
179
+ ![download](https://github.com/user-attachments/assets/624097ea-06f2-4c8f-83f6-b6b9ee439c0c)