File size: 3,235 Bytes
353340d
 
 
 
 
 
 
 
 
 
 
 
80a2f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
datasets:
- yiye2023/GUIChat
- yiye2023/GUIEnv
- yiye2023/GUIAct
language:
- en
tags:
- GUI
- Agent
- minicpm
---

# ๐Ÿ“ฑ๐Ÿ–ฅ๏ธ GUIDance: Vision Langauge Models as Your Screen Guide

Introducing the GUIDance, Model that trained on GUICourse! ๐ŸŽ‰
By leveraging extensive OCR pretraining with grounding ability, we unlock the potential of parsing-free methods for GUIAgent.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/5d4rJFWjKn-c-iOXJKYXF.png)

# News
- 2024-07-09: ๐Ÿš€ We released MiniCPM-GUIDance on huggingface.
- 2024-03-09: ๐Ÿ“ฆ We have open-sourced guicourse, [GUIAct](https://huggingface.co/datasets/yiye2023/GUIAct),[GUIChat](https://huggingface.co/datasets/yiye2023/GUIChat), [GUIEnv](https://huggingface.co/datasets/yiye2023/GUIEnv)

# ToDo
[ ] Update detailed task type prompt
[ ] Batch inference

# Example
Pip install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99

flash_attn==2.4.2
```
First you are suggested to git clone this huggingface repo or download repo with huggingface_cli.
```
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-guidance
```
or
```
huggingface-cli download RhapsodyAI/minicpm-guidance

```
```python
from transformers import AutoProcessor, AutoTokenizer, AutoModel
from PIL import Image
import torch

MODEL_PATH = '/path/to/minicpm-guidance'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, attn_implementation="eager", torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16)
model.cuda().eval()

# Currently only support batch=1
example_messages = [
    [
        {
            "role": "user",
            "content": Image.open("/home/jeeves/cuiunbo/minicpmv/examples/test.png").convert('RGB')
        },
        {
            "role": "user",
            "content": "What this is?"
        }
    ]
]

input = processor(example_messages, padding_side="right")

for key in input:
    if isinstance(a[key], list):
        for i in range(len(a[key])):
            if isinstance(a[key][i], torch.Tensor):
                input[key][i] = a[key][i].cuda()
    if isinstance(input[key], torch.Tensor):
        input[key] = input[key].cuda()
        
with torch.no_grad():
    outputs = model.generate(input, max_new_tokens=64, do_sample=False, num_beams=3)
    text = tokenizer.decode(outputs[0].cpu().tolist())
    text = tokenizer.batch_decode(outputs.cpu().tolist())
    
    for i in text:
        print('-'*20)
        print(i)
```

# Citation
If you find our work useful, please consider cite us:
```
@misc{,
  title={GUICourse: From General Vision Language Models to Versatile GUI Agents},
  author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun},
  year={2024},
  journal={arXiv preprint arXiv:2406.11317},
}
```