File size: 5,002 Bytes
353340d
 
 
 
 
 
 
 
 
 
 
 
2b0f823
80a2f49
 
 
 
d942360
80a2f49
 
 
 
ce7e343
0e824c9
80a2f49
 
 
 
 
ce7e343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80a2f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce7e343
80a2f49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce7e343
80a2f49
 
 
ce7e343
80a2f49
 
 
 
 
 
 
ce7e343
 
 
 
80a2f49
 
 
 
ce7e343
80a2f49
 
 
 
 
 
ce7e343
 
 
 
 
 
 
 
 
80a2f49
a8ea64f
 
854aaf3
80a2f49
 
a8ea64f
80a2f49
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
license: apache-2.0
datasets:
- yiye2023/GUIChat
- yiye2023/GUIEnv
- yiye2023/GUIAct
language:
- en
tags:
- GUI
- Agent
- minicpm
pipeline_tag: visual-question-answering
---

# πŸ“±πŸ–₯️ GUIDance: Vision Langauge Models as Your Screen Guide

Introducing the MiniCPM-GUIDance, Model(referred to MiniCPM-GUI) that trained on [GUICourse](https://arxiv.org/pdf/2406.11317)! πŸŽ‰

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/5d4rJFWjKn-c-iOXJKYXF.png)

# News
- 2024-07-09: πŸš€ We released MiniCPM-GUIDance on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-guidance).
- 2024-06-07: πŸ“š We released the datasets, loading code, and evaluation code on [github](https://github.com/yiye3/GUICourse).
- 2024-03-09: πŸ“¦ We have open-sourced guicourse, [GUIAct](https://huggingface.co/datasets/yiye2023/GUIAct),[GUIChat](https://huggingface.co/datasets/yiye2023/GUIChat), [GUIEnv](https://huggingface.co/datasets/yiye2023/GUIEnv)

# ToDo
[ ] Batch inference

# CookBook
 - Prompt for Actions
```
Your Task
{Task}
Generate next actions to do this task.
```
```
Actions History
{hover, select_text, click, scroll}
Information
{Information about the web}
Your Task
{TASK}
Generate next actions to do this task.
```
 - Prompt for Chat w or w/o Grounding
```
{Query}

OR

{Query} Grounding all objects in the image.
```

# Example
Pip install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99

flash_attn==2.4.2
```
First you are suggested to git clone this huggingface repo or download repo with huggingface_cli.
```
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-guidance
```
or
```
huggingface-cli download RhapsodyAI/minicpm-guidance

```
Example case image: ![case](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/KJFeGDBj3SOgQqGAU7lU5.png)
```python
from transformers import AutoProcessor, AutoTokenizer, AutoModel
from PIL import Image
import torch

MODEL_PATH = '/path/to/minicpm-guidance'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, attn_implementation="eager", torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16)
model.cuda().eval()

# Currently only support batch=1
example_messages = [
    [
        {
            "role": "user",
            "content": Image.open("./case.png").convert('RGB')
        },
        {
            "role": "user",
            "content": "How could I use this model from this web? Grounding all objects in the image."
        }
    ]
]

input = processor(example_messages, padding_side="right")

for key in input:
    if isinstance(input[key], list):
        for i in range(len(input[key])):
            if isinstance(input[key][i], torch.Tensor):
                input[key][i] = input[key][i].cuda()
    if isinstance(input[key], torch.Tensor):
        input[key] = input[key].cuda()
        
with torch.no_grad():
    outputs = model.generate(input, max_new_tokens=1024, do_sample=False, num_beams=3)
    text = tokenizer.decode(outputs[0].cpu().tolist())
    text = tokenizer.batch_decode(outputs.cpu().tolist())
    
    for i in text:
        print('-'*20)
        print(i)

'''
To use the model from this webpage, you would typically follow these steps:
1. **Access the Model**: Navigate to the section of the webpage where the model is described. In this case, it's under the heading "Use this model"<box> 864 238 964 256</box> .
2. **Download the Model**: There should be a link or button that allows you to download the model. Look for a button or link that says "Download" or something similar.
3. **Install the Model**: Once you've downloaded the model, you'll need to install it on your system. This typically involves extracting the downloaded file and placing it in a directory where the model can be found.
4. **Use the Model**: After installation, you can use the model in your application or project. This might involve importing the model into your programming environment and using it to perform specific tasks.
The exact steps would depend on the specifics of the model and the environment in which you're using it, but these are the general steps you would follow to use the model from this webpage.</s>
'''
```
# Contact

[Junbo Cui](mailto:cuijb2000@gmail.com)

# Citation
If you find our work useful, please consider citing us:
```
@misc{,
  title={GUICourse: From General Vision Language Models to Versatile GUI Agents},
  author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun},
  year={2024},
  journal={arXiv preprint arXiv:2406.11317},
}
```