minicpm-guidance / README.md
Cuiunbo's picture
Update README.md
a8ea64f verified
|
raw
history blame
5.15 kB
metadata
license: apache-2.0
datasets:
  - yiye2023/GUIChat
  - yiye2023/GUIEnv
  - yiye2023/GUIAct
language:
  - en
tags:
  - GUI
  - Agent
  - minicpm
pipeline_tag: visual-question-answering

πŸ“±πŸ–₯️ GUIDance: Vision Langauge Models as Your Screen Guide

Introducing the MiniCPM-GUIDance, Model(referred to MiniCPM-GUI) that trained on GUICourse! πŸŽ‰

image/png

News

  • 2024-07-09: πŸš€ We released MiniCPM-GUIDance on huggingface.
  • 2024-06-07: πŸ“š We released the datasets, loading code, and evaluation code on github.
  • 2024-03-09: πŸ“¦ We have open-sourced guicourse, GUIAct,GUIChat, GUIEnv

ToDo

[ ] Batch inference

CookBook

  • Prompt for Actions
Your Task
{Task}
Generate next actions to do this task.
Actions History
{hover, select_text, click, scroll}
Information
{Information about the web}
Your Task
{TASK}
Generate next actions to do this task.
  • Prompt for Chat w or w/o Grounding
{Query}

OR

{Query} Grounding all objects in the image.

Example

Pip install all dependencies:

Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99

flash_attn==2.4.2

First you are suggested to git clone this huggingface repo or download repo with huggingface_cli.

git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-guidance

or

huggingface-cli download RhapsodyAI/minicpm-guidance

Example case image: case

from transformers import AutoProcessor, AutoTokenizer, AutoModel
from PIL import Image
import torch

MODEL_PATH = '/path/to/minicpm-guidance'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, attn_implementation="eager", torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16)
model.cuda().eval()

# Currently only support batch=1
example_messages = [
    [
        {
            "role": "user",
            "content": Image.open("./case.png").convert('RGB')
        },
        {
            "role": "user",
            "content": "How could I use this model from this web? Grounding all objects in the image."
        }
    ]
]

input = processor(example_messages, padding_side="right")

for key in input:
    if isinstance(input[key], list):
        for i in range(len(input[key])):
            if isinstance(input[key][i], torch.Tensor):
                input[key][i] = input[key][i].cuda()
    if isinstance(input[key], torch.Tensor):
        input[key] = input[key].cuda()
        
with torch.no_grad():
    outputs = model.generate(input, max_new_tokens=1024, do_sample=False, num_beams=3)
    text = tokenizer.decode(outputs[0].cpu().tolist())
    text = tokenizer.batch_decode(outputs.cpu().tolist())
    
    for i in text:
        print('-'*20)
        print(i)

'''
To use the model from this webpage, you would typically follow these steps:
1. **Access the Model**: Navigate to the section of the webpage where the model is described. In this case, it's under the heading "Use this model"<box> 864 238 964 256</box> .
2. **Download the Model**: There should be a link or button that allows you to download the model. Look for a button or link that says "Download" or something similar.
3. **Install the Model**: Once you've downloaded the model, you'll need to install it on your system. This typically involves extracting the downloaded file and placing it in a directory where the model can be found.
4. **Use the Model**: After installation, you can use the model in your application or project. This might involve importing the model into your programming environment and using it to perform specific tasks.
The exact steps would depend on the specifics of the model and the environment in which you're using it, but these are the general steps you would follow to use the model from this webpage.</s>
'''

Contact

Wentong Chen, Renmin University of China

Junbo Cui, Jinyi Hu, Tsinghua University

Citation

If you find our work useful, please consider citing us:

@misc{,
  title={GUICourse: From General Vision Language Models to Versatile GUI Agents},
  author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun},
  year={2024},
  journal={arXiv preprint arXiv:2406.11317},
}