sujitvasanth commited on
Commit
1eab6e5
·
verified ·
1 Parent(s): e56822e

Upload 6 files

Browse files

minimal exllamav2 quantised model

Files changed (6) hide show
  1. README.md +495 -0
  2. config.json +79 -0
  3. inference_example.py +153 -0
  4. merges.txt +0 -0
  5. preprocessor_config.json +18 -0
  6. tokenizer.json +0 -0
README.md ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
+ datasets:
5
+ - xlangai/AgentNet
6
+ - xlangai/aguvis-stage1
7
+ - smolagents/aguvis-stage-2
8
+ - osunlp/UGround-V1-Data
9
+ language:
10
+ - en
11
+ license: mit
12
+ metrics:
13
+ - accuracy
14
+ - code_eval
15
+ pipeline_tag: image-text-to-text
16
+ library_name: transformers
17
+ tags:
18
+ - VLM
19
+ - Computer-Use-Agent
20
+ - OS-Agent
21
+ - GUI
22
+ - Grounding
23
+ ---
24
+
25
+ <h1 style="
26
+ font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
27
+ font-size:48px;
28
+ font-weight:700;
29
+ line-height:1.25;
30
+ text-align:center;
31
+ margin:0 0 24px;">
32
+ OpenCUA: Open Foundations for Computer-Use Agents
33
+ </h1>
34
+
35
+ <div style="
36
+ display:flex;
37
+ justify-content:center;
38
+ gap:12px;
39
+ flex-wrap:wrap;
40
+ margin-bottom:28px;">
41
+
42
+ <a href="https://opencua.xlang.ai/" style="
43
+ display:inline-block;
44
+ padding:8px 24px;
45
+ background:#2b2b2b;
46
+ color:#ffffff;
47
+ border-radius:36px;
48
+ text-decoration:none;
49
+ font-weight:600;
50
+ font-size:16px;">
51
+ 🌐 Website
52
+ </a>
53
+
54
+ <a href="https://arxiv.org/abs/2508.09123" style="
55
+ display:inline-block;
56
+ padding:8px 24px;
57
+ background:#2b2b2b;
58
+ color:#ffffff;
59
+ border-radius:36px;
60
+ text-decoration:none;
61
+ font-weight:600;
62
+ font-size:16px;">
63
+ 📝 Paper
64
+ </a>
65
+
66
+ <a href="https://github.com/xlang-ai/OpenCUA" style="
67
+ display:inline-block;
68
+ padding:8px 24px;
69
+ background:#2b2b2b;
70
+ color:#ffffff;
71
+ border-radius:36px;
72
+ text-decoration:none;
73
+ font-weight:600;
74
+ font-size:16px;">
75
+ 💻 Code
76
+ </a>
77
+ </div>
78
+
79
+ <div style="max-width:900px;margin:0 auto;">
80
+
81
+ # Introduction
82
+ <div style="
83
+ max-width: 880px; /* 可按需调节整体宽度 */
84
+ margin: 0 auto; /* 居中容器 */
85
+ text-align: justify; /* 关键:两端对齐 */
86
+ text-justify: inter-word; /* 优化英文对齐效果 */
87
+ line-height: 1.6;">
88
+
89
+ OpenCUA models (OpenCUA-7B and OpenCUA-32B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction and Qwen2.5-VL-32B-Instruction.
90
+ They demonstrate superior performance across CUA benchmarks. In particular, <b>OpenCUA-32B</b> achieves an average success rate of **34.8%** on [OSWorld-Verified](https://os-world.github.io/),
91
+ establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Both models also have strong grounding performance, OpenCUA-32B achieves 59.6% on [OSWorld-G](https://osworld-grounding.github.io/) and 55.3% on [Screenspot-Pro](https://arxiv.org/abs/2504.07981).
92
+ </div>
93
+
94
+ ### Key Features
95
+
96
+ - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
97
+ - **Multi-OS Support**: Trained on demonstrations across Ubuntu, Windows, and macOS
98
+ - **Visual Grounding**: Strong GUI element recognition and spatial reasoning capabilities
99
+ - **Multi-Image Context**: Processes up to 3 screenshot history for better context understanding
100
+ - **Reflective Reasoning**: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning
101
+
102
+
103
+ # Performance
104
+
105
+ ### Online Agent Evaluation
106
+ OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
107
+ OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
108
+ It also closes the gap to proprietary Claude models.
109
+ <div align="center">
110
+
111
+ | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
112
+ |-------------------------------|:--------:|:--------:|:---------:|
113
+ | **Proprietary** | | | |
114
+ | OpenAI CUA | 26.0 | 31.3 | 31.4 |
115
+ | Seed 1.5-VL | 27.9 | — | 34.1 |
116
+ | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
117
+ | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
118
+ | **Open-Source** | | | |
119
+ | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
120
+ | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
121
+ | Kimi-VL-A3B | 9.7 | — | 10.3 |
122
+ | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
123
+ | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
124
+ | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
125
+ | **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
126
+ </div>
127
+
128
+ *OpenCUA scores are the mean of 3 independent runs.*
129
+
130
+ ### GUI Grounding Performance
131
+ <div align="center">
132
+
133
+ | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
134
+ |-------|-----------|---------------|----------------|
135
+ | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
136
+ | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
137
+ | UI-TARS-72B | 57.1 | 90.3 | 38.1 |
138
+ | **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
139
+ | **OpenCUA-Qwen2-7B** | 45.7 | 88.5 | 23.7 |
140
+ | **OpenCUA-7B** | 55.3 | 92.3 | 50.0 |
141
+ | **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
142
+ </div>
143
+
144
+
145
+ ### AgentNetBench (Offline Evaluation)
146
+ <div align="center">
147
+
148
+ | **Model** | **Coordinate Actions** | **Content Actions** | **Function Actions** | **Average** |
149
+ |-------|-------------------|-----------------|------------------|---------|
150
+ | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
151
+ | Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
152
+ | Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
153
+ | OpenAI CUA | 71.7 | 57.3 | **80.0** | 73.1 |
154
+ | **OpenCUA-7B** | 79.0 | 62.0 | 44.3 | 75.2 |
155
+ | **OpenCUA-32B** | **81.9** | 66.1 | 55.7 | **79.1** |
156
+ </div>
157
+
158
+ # 🚀 Quick Start
159
+ <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
160
+ <strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>
161
+
162
+ To align with our training infrastructure, we have modified the model in two places:
163
+ <ul style="margin-top: 8px;">
164
+ <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
165
+ <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
166
+ <li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li>
167
+ </ul>
168
+ </div>
169
+
170
+
171
+ ## Installation & Download
172
+
173
+ First, install the required transformers dependencies:
174
+
175
+ ```bash
176
+ conda create -n opencua python=3.10
177
+ conda activate opencua
178
+ pip install -r requirement.txt
179
+ ```
180
+
181
+ Download the model weight from huggingface:
182
+ ```bash
183
+ from huggingface_hub import snapshot_download
184
+ snapshot_download(
185
+ repo_id="xlangai/OpenCUA-7B",
186
+ local_dir="OpenCUA-7B",
187
+ local_dir_use_symlinks=False
188
+ )
189
+ ```
190
+
191
+ ## 🎯 GUI Grounding
192
+
193
+ The following code demonstrates how to use OpenCUA models for GUI grounding tasks:
194
+
195
+ ```python
196
+ import base64
197
+ import torch
198
+ from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
199
+ from PIL import Image
200
+ import json
201
+
202
+ def encode_image(image_path: str) -> str:
203
+ """Encode image to base64 string for model input."""
204
+ with open(image_path, "rb") as f:
205
+ return base64.b64encode(f.read()).decode()
206
+
207
+ def load_opencua_model(model_path: str):
208
+ """Load OpenCUA model, tokenizer, and image processor."""
209
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
210
+ model = AutoModel.from_pretrained(
211
+ model_path,
212
+ torch_dtype="auto",
213
+ device_map="auto",
214
+ trust_remote_code=True
215
+ )
216
+ image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
217
+
218
+ return model, tokenizer, image_processor
219
+
220
+ def create_grounding_messages(image_path: str, instruction: str):
221
+ """Create chat messages for GUI grounding task."""
222
+ system_prompt = (
223
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
224
+ "You need to perform a series of pyautogui actions to complete the task."
225
+ )
226
+
227
+ messages = [
228
+ {"role": "system", "content": system_prompt},
229
+ {
230
+ "role": "user",
231
+ "content": [
232
+ {"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
233
+ {"type": "text", "text": instruction},
234
+ ],
235
+ },
236
+ ]
237
+ return messages
238
+
239
+ def run_inference(model, tokenizer, image_processor, messages, image_path):
240
+ """Run inference on the model."""
241
+ # Prepare text input
242
+ input_ids = tokenizer.apply_chat_template(
243
+ messages, tokenize=True, add_generation_prompt=True
244
+ )
245
+ input_ids = torch.tensor([input_ids]).to(model.device)
246
+
247
+ # Prepare image input
248
+ image = Image.open(image_path).convert('RGB')
249
+ image_info = image_processor.preprocess(images=[image])
250
+ pixel_values = torch.tensor(image_info['pixel_values']).to(
251
+ dtype=torch.bfloat16, device=model.device
252
+ )
253
+ grid_thws = torch.tensor(image_info['image_grid_thw'])
254
+
255
+ # Generate response
256
+ with torch.no_grad():
257
+ generated_ids = model.generate(
258
+ input_ids,
259
+ pixel_values=pixel_values,
260
+ grid_thws=grid_thws,
261
+ max_new_tokens=512,
262
+ temperature=0
263
+ )
264
+
265
+ # Decode output
266
+ prompt_len = input_ids.shape[1]
267
+ generated_ids = generated_ids[:, prompt_len:]
268
+ output_text = tokenizer.batch_decode(
269
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
270
+ )[0]
271
+
272
+ return output_text
273
+
274
+ # Example usage
275
+ model_path = "OpenCUA/OpenCUA-7B" # or other model variants
276
+ image_path = "screenshot.png"
277
+ instruction = "Click on the submit button"
278
+
279
+ # Load model
280
+ model, tokenizer, image_processor = load_opencua_model(model_path)
281
+
282
+ # Create messages and run inference
283
+ messages = create_grounding_messages(image_path, instruction)
284
+ result = run_inference(model, tokenizer, image_processor, messages, image_path)
285
+
286
+ print("Model output:", result)
287
+ ```
288
+
289
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
290
+ <em>Expected result:</em> ```python
291
+ pyautogui.click(x=1443, y=343)
292
+ ```
293
+ </div>
294
+
295
+ You can also run the five grounding examples in [OpenCUA/model/inference/huggingface_inference.py](https://github.com/xlang-ai/OpenCUA/blob/main/model/inference/huggingface_inference.py):
296
+ ```
297
+ cd ./model/inference/
298
+ python huggingface_inference.py
299
+ ```
300
+
301
+ ## 🖥️ Computer Use Agent
302
+ **[OpenCUAAgent](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/opencua_agent.py)** is developed in the [OSWorld](https://github.com/xlang-ai/OSWorld) environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.
303
+
304
+ Command for running OpenCUA-7B and OpenCUA-32B in OSWorld:
305
+ ```
306
+ python run_multienv_opencua.py \
307
+ --headless \
308
+ --observation_type screenshot \
309
+ --model OpenCUA-32B \
310
+ --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
311
+ --max_steps 100 \
312
+ --num_envs 30 \
313
+ --coordinate_type qwen25
314
+ ```
315
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
316
+ <em>Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned.</em>
317
+ </div>
318
+
319
+ ---
320
+
321
+ # AgentNet Dataset - Large-Scale Computer-Use Dataset
322
+
323
+ <div align="center">
324
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/dw5k183ucDSB2SZuS5f2V.png" width="400" alt="AgentNet Dataset Domain Distribution">
325
+ </div>
326
+
327
+ AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems.
328
+
329
+ 👉 **[AgentNet Huggingface Dataset](https://huggingface.co/datasets/xlangai/AgentNet)**
330
+
331
+ Download the dataset here:
332
+ ```
333
+ pip install -U huggingface_hub
334
+ huggingface-cli download xlangai/AgentNet --repo-type dataset --local-dir ./AgentNet
335
+ ```
336
+
337
+ Collecting computer-use agent training data requires 3 steps:
338
+ - Demonstrate human computer-use task via [AgentNetTool](https://agentnet-tool.xlang.ai/);
339
+ - Preprocess the demonstration using [Action Reduction & State-Action Matching](./data/data-processor);
340
+ - For each step, [synthesize reflective long CoT](./data/cot-generator)
341
+
342
+
343
+ ## 1 AgentNetTool – Annotation & Verification Tool
344
+ <div align="center">
345
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/ETjCOoIRR7f1YZCJ2kfiW.png" width="700" alt="AgentNet Tool">
346
+ </div>
347
+
348
+
349
+ Our **AgentNetTool** is a cross-platform GUI recorder that runs unobtrusively on annotators’ machines. It captures synchronized **screen video**, **mouse/keyboard events**, and **accessibility trees**, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu.
350
+
351
+ 👉 **[AgentNetTool Document](https://agentnet-tool.xlang.ai/)**
352
+
353
+
354
+
355
+ ## 2 DataProcessor – Action Reduction & State–Action Matching
356
+ Raw demonstrations can contain thousands of low-level events that are too dense for model training.
357
+ The **DataProcessor** module (`./data/data-process/`) performs two key steps:
358
+
359
+ 1. **Action Reduction** — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys).
360
+ 2. **State–Action Matching** — aligns every reduced action with the *last visually distinct frame* **before** the action begins, avoiding future-information leakage and yielding compact state–action pairs.
361
+
362
+ These processed trajectories underlie all downstream training and evaluation.
363
+
364
+ ---
365
+
366
+ ## 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue
367
+ To boost robustness and interpretability, we augment each trajectory with **reflective long Chain-of-Thought (CoT) reasoning**.
368
+ The **CoTGenerator** pipeline (`./data/cot-generator/`) synthesizes step-level reflections that:
369
+
370
+ * reflect on the previous action,
371
+ * explain *why* an action is chosen given the current observation and history,
372
+ * note potential alternative actions, and
373
+ * forecast the expected next state.
374
+
375
+ Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications.
376
+
377
+
378
+ # Evaluation
379
+
380
+ <div align="center">
381
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67b327cdd4665a0448eef7d5/emy1QCJwQj9KqHkVmtNH2.png" width="800" alt="AgentNetBench">
382
+ </div>
383
+
384
+
385
+ **AgentNetBench** (`./AgentNetBench/`) provides a realistic offline evaluator for OS agent trajectories. It compares model-predicted low-level actions (click, moveTo, write, press, scroll, terminate, etc.) against ground-truth human actions and reports detailed metrics.
386
+
387
+ 👉 See **[AgentNetBench/README.md](./evaluation/agentnetbench/README.md)** for usage instructions.
388
+
389
+ # TODO
390
+ ## vLLM Support
391
+ We are actively working with the vLLM team to add support for OpenCUA models.
392
+
393
+ **Workaround:** For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available.
394
+
395
+ ## Training Code
396
+ OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well.
397
+
398
+ # Acknowledge
399
+ <p>
400
+ We thank Su Yu, Caiming Xiong, Binyuan Hui, and the anonymous reviewers for their insightful discussions and valuable feedback.
401
+ We are grateful to Moonshot AI for providing training infrastructure and annotated data.
402
+ We also sincerely appreciate Calvin, Ziwei Chen, Jin Zhang, Ze Li, Zhengtao Wang, Yanxu Chen, and Qizheng Gu from the Kimi Team for their strong infrastructure support and helpful guidance.
403
+ The development of our tool is based on the open-source projects-<a href="https://github.com/TheDuckAI/DuckTrack" target="_blank">DuckTrack</a> and <a href="https://github.com/OpenAdaptAI/OpenAdapt" target="_blank">OpenAdapt</a>.
404
+ We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.
405
+ </p>
406
+
407
+ # License
408
+
409
+ This project is licensed under the MIT License - see the LICENSE file in the root folder for details.
410
+
411
+ ## Research Use and Disclaimer
412
+
413
+ OpenCUA models are intended for **research and educational purposes only**.
414
+
415
+ ### Prohibited Uses
416
+ - The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
417
+ - Use for illegal, unethical, or harmful activities is strictly prohibited
418
+
419
+ ### Disclaimer
420
+ - The authors, contributors, and copyright holders are **not responsible** for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use
421
+ - Use of the "OpenCUA" name, logo, or trademarks does **not** imply any endorsement or affiliation unless separate written permission is obtained
422
+ - Users are solely responsible for ensuring their use complies with applicable laws and regulations
423
+
424
+ ## Important Notes on Coordinate Systems
425
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
426
+ <ul style="margin: 0;">
427
+ <li><strong><code>OpenCUA/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
428
+ <li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
429
+ <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
430
+ <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
431
+ </ul>
432
+ </div>
433
+
434
+ **OpenCUA models use different coordinate systems depending on the base model:**
435
+
436
+ - **OpenCUA-Qwen2-7B**: Outputs **relative coordinates** (0.0 to 1.0 range)
437
+ ```python
438
+ # Example output: pyautogui.click(x=0.5, y=0.3)
439
+ # x=0.5 means 50% from left edge, y=0.3 means 30% from top edge
440
+
441
+ # Convert to absolute coordinates:
442
+ def qwen2_relative_to_absolute(rel_x, rel_y, original_width, original_height):
443
+ abs_x = int(rel_x * original_width)
444
+ abs_y = int(rel_y * original_height)
445
+ return abs_x, abs_y
446
+ ```
447
+
448
+ - **OpenCUA-7B and OpenCUA-32B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
449
+ ```python
450
+ # Example output: pyautogui.click(x=960, y=324)
451
+ # These are coordinates on the smart-resized image, not the original image
452
+
453
+ # Convert to original image coordinates:
454
+ # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
455
+ def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
456
+ # First, calculate the smart-resized dimensions
457
+ resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
458
+
459
+ # Convert model output to relative coordinates on original image
460
+ rel_x = model_x / resized_width
461
+ rel_y = model_y / resized_height
462
+
463
+ # Then convert to absolute coordinates on original image
464
+ abs_x = int(rel_x * original_width)
465
+ abs_y = int(rel_y * original_height)
466
+ return abs_x, abs_y
467
+ ```
468
+
469
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
470
+ <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
471
+ <p style="margin: 8px 0 0;">
472
+ The Qwen2.5-VL models use a “smart resize” preprocessing that maintains aspect ratio while fitting within pixel constraints.
473
+ For coordinate conversion, you need the smart resize function from the
474
+ <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
475
+ official Qwen2.5-VL implementation</a>.
476
+ </p>
477
+ </div>
478
+
479
+ ## Citation
480
+
481
+ If you use OpenCUA models in your research, please cite our work:
482
+
483
+ ```bibtex
484
+ @misc{wang2025opencuaopenfoundationscomputeruse,
485
+ title={OpenCUA: Open Foundations for Computer-Use Agents},
486
+ author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
487
+ year={2025},
488
+ eprint={2508.09123},
489
+ archivePrefix={arXiv},
490
+ primaryClass={cs.AI},
491
+ url={https://arxiv.org/abs/2508.09123},
492
+ }
493
+ ```
494
+
495
+ </div>
config.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenCUAForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_opencua.OpenCUAConfig",
7
+ "AutoModel": "modeling_opencua.OpenCUAForConditionalGeneration",
8
+ "AutoModelForCausalLM": "modeling_opencua.OpenCUAForConditionalGeneration"
9
+ },
10
+ "ignore_index": -100,
11
+ "media_placeholder_token_id": 151664,
12
+ "model_type": "opencua",
13
+ "pad_token_id": 0,
14
+ "text_config": {
15
+ "bos_token_id": 151643,
16
+ "eos_token_id": 151644,
17
+ "head_dim": 128,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 3584,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 18944,
22
+ "k_proj_bias": true,
23
+ "max_length": 20,
24
+ "min_length": 0,
25
+ "model_type": "qwen2",
26
+ "num_attention_heads": 28,
27
+ "num_beam_groups": 1,
28
+ "num_beams": 1,
29
+ "num_hidden_layers": 28,
30
+ "num_key_value_heads": 4,
31
+ "pad_token_id": 152063,
32
+ "pretraining_sequence_length": 128000,
33
+ "q_proj_bias": true,
34
+ "rms_norm_eps": 1e-05,
35
+ "rope_theta": 1000000.0,
36
+ "tie_word_embeddings": false,
37
+ "torch_dtype": "bfloat16",
38
+ "use_bfloat16": false,
39
+ "use_cache": true,
40
+ "v_proj_bias": true,
41
+ "vocab_size": 152064
42
+ },
43
+ "tie_word_embeddings": false,
44
+ "torch_dtype": "bfloat16",
45
+ "transformers_version": "4.48.3",
46
+ "vision_config": {
47
+ "depth": 32,
48
+ "fullatt_block_indexes": [
49
+ 7,
50
+ 15,
51
+ 23,
52
+ 31
53
+ ],
54
+ "hidden_act": "silu",
55
+ "hidden_size": 1280,
56
+ "num_heads": 16,
57
+ "in_chans": 3,
58
+ "intermediate_size": 3420,
59
+ "patch_size": 14,
60
+ "spatial_merge_size": 2,
61
+ "spatial_patch_size": 14,
62
+ "temporal_patch_size": 2,
63
+ "out_hidden_size": 3584,
64
+ "tokens_per_second": 2,
65
+ "window_size": 112
66
+ },
67
+ "vocab_size": 152064,
68
+ "quantization_config": {
69
+ "quant_method": "exl2",
70
+ "version": "0.3.2",
71
+ "bits": 4.5,
72
+ "head_bits": 6,
73
+ "calibration": {
74
+ "rows": 100,
75
+ "length": 2048,
76
+ "dataset": "wikitext_cal_data.parquet"
77
+ }
78
+ }
79
+ }
inference_example.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys, torch
2
+
3
+ # ====================================================================
4
+ # --- MONKEY-PATCH FOR OPENCUA ARCHITECTURE ---
5
+ # This block must come BEFORE any other exllamav2 imports.
6
+ # It injects our custom model profile into the exllamav2 library at runtime.
7
+ # make sure you have a version of exlllamav2 that
8
+ # has exllamav2/vlm/vision_tower.py
9
+
10
+ from exllamav2.architecture import (
11
+ ExLlamaV2ArchParams,
12
+ RopeStyle,
13
+ layer_keys_llama_norms,
14
+ layer_keys_llama_attn,
15
+ layer_keys_llama_mlp,
16
+ expect_keys_llama
17
+ )
18
+
19
+ print(" -- Applying OpenCUA architecture monkey-patch for inference...")
20
+
21
+ # Store a reference to the original __init__ method
22
+ original_init = ExLlamaV2ArchParams.__init__
23
+
24
+ # Define our new, patched __init__ method
25
+ def patched_init(self, arch_string, read_config):
26
+
27
+ # --- Our Custom Logic ---
28
+ if arch_string == "OpenCUAForConditionalGeneration":
29
+
30
+ # This is our entire custom profile, verified from our debugging
31
+ arch_recognized = True
32
+
33
+ # --- Language Model settings ---
34
+ self.lm_prefix = "language_model."
35
+ self.lm.layer_keys += \
36
+ layer_keys_llama_norms + \
37
+ layer_keys_llama_attn + \
38
+ layer_keys_llama_mlp
39
+ self.lm.expect_keys += \
40
+ expect_keys_llama
41
+ self.lm.attention_bias_qkv = True
42
+ self.lm.supports_tp = True
43
+
44
+ # --- Vision Tower settings ---
45
+ self.vt_prefix = "vision_tower."
46
+ read_config["vision_config"].update({"model_type": "qwen2.5"})
47
+ self.vt.keys.update({
48
+ "fused_qkv": ".attn.qkv",
49
+ "attn_o": ".attn.proj",
50
+ "mlp_gate": ".mlp.gate_proj",
51
+ "mlp_up": ".mlp.up_proj",
52
+ "mlp_down": ".mlp.down_proj",
53
+ "norm_1": ".norm1",
54
+ "norm_2": ".norm2",
55
+ "layers": "blocks",
56
+ "patch_conv": "patch_embed.proj",
57
+ })
58
+ self.vt.mlp_gate = True
59
+ self.vt.mlp_act_func = "silu"
60
+ self.vt.norm = "rmsnorm"
61
+ self.vt.mlp_bias = True
62
+ self.vt.attention_bias_qkv = True
63
+ self.vt.attention_bias_o = True
64
+ self.vt.vision_input_norm = False
65
+ self.vt.vision_conv3d = True
66
+ self.vt.rope_style = RopeStyle.NONE
67
+ self.vt.mlp_merger = True
68
+
69
+ # --- Multi-Modal Projector/Merger settings ---
70
+ self.mmp_prefix = "vision_tower.merger."
71
+ self.mmp.keys.update({
72
+ "mlp_gate": None,
73
+ "mlp_up": "mlp.0",
74
+ "mlp_down": "mlp.2",
75
+ "norm_2": "ln_q",
76
+ })
77
+ self.mmp.mlp_gate = False
78
+ self.mmp.mlp_act_func = "gelu"
79
+ self.mmp.mlp_bias = True
80
+ self.mmp.norm = "layernorm"
81
+
82
+ # --- Fallback to Original ---
83
+ else:
84
+ # If it's not our model, call the original __init__ method
85
+ original_init(self, arch_string, read_config)
86
+
87
+ # Overwrite the class's __init__ method with our patched version
88
+ ExLlamaV2ArchParams.__init__ = patched_init
89
+ print(" -- Patch applied successfully.")
90
+
91
+ # --- END OF MONKEY-PATCH ---
92
+ # ====================================================================
93
+
94
+
95
+ # NOW we can import the rest of the library
96
+ from exllamav2 import (
97
+ ExLlamaV2,
98
+ ExLlamaV2Config,
99
+ ExLlamaV2Cache,
100
+ ExLlamaV2Tokenizer,
101
+ ExLlamaV2VisionTower,
102
+ )
103
+ from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler
104
+ from PIL import Image
105
+ import requests
106
+ import traceback
107
+
108
+ MODEL_PATH = "/home/sujit/OpenCUA-7B-exl2"
109
+ IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"
110
+
111
+ try:
112
+ print(" -- Loading model...")
113
+ config = ExLlamaV2Config(MODEL_PATH) # <-- The patch is active when this line runs
114
+ model = ExLlamaV2(config)
115
+ cache = ExLlamaV2Cache(model, lazy=True)
116
+ model.load_autosplit(cache)
117
+ tokenizer = ExLlamaV2Tokenizer(config)
118
+
119
+ print(" -- Loading vision tower...")
120
+ vision_tower = ExLlamaV2VisionTower(config)
121
+ vision_tower.load()
122
+
123
+ generator = ExLlamaV2DynamicGenerator(model, cache, tokenizer)
124
+
125
+ print(f" -- Downloading test image from: {IMAGE_URL}")
126
+ image = Image.open(requests.get(IMAGE_URL, stream=True).raw).convert("RGB")
127
+ instruction = "Describe what you see in this image in detail."
128
+
129
+ print(" -- Processing image and building prompt...")
130
+ image_embeddings = vision_tower.get_image_embeddings(model, tokenizer, image)
131
+
132
+ prompt = f"<|user|>\n{image_embeddings.text_alias}\n{instruction}<|end|>\n<|assistant|>"
133
+
134
+ print(f"\n--- Prompt Sent to Model ---\n{prompt.replace(image_embeddings.text_alias, '<image>')}\n----------------------------")
135
+ print("\n--- Model Output ---")
136
+
137
+ gen_settings = ExLlamaV2Sampler.Settings.greedy()
138
+
139
+ output = generator.generate(
140
+ prompt=prompt,
141
+ max_new_tokens=200,
142
+ add_bos=True,
143
+ embeddings=[image_embeddings],
144
+ gen_settings=gen_settings,
145
+ decode_special_tokens=True,
146
+ )
147
+
148
+ print(output)
149
+ print("\n--- Test Complete ---")
150
+
151
+ except Exception as e:
152
+ print(f"\nAn error occurred: {e}")
153
+ traceback.print_exc()
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor"
18
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff