--- license: apache-2.0 datasets: - yiye2023/GUIChat - yiye2023/GUIEnv - yiye2023/GUIAct language: - en tags: - GUI - Agent - minicpm pipeline_tag: visual-question-answering --- # 📱🖥️ GUIDance: Vision Langauge Models as Your Screen Guide Introducing the MiniCPM-GUIDance, Model(referred to MiniCPM-GUI) that trained on [GUICourse](https://arxiv.org/pdf/2406.11317)! 🎉 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/5d4rJFWjKn-c-iOXJKYXF.png) # News - 2024-07-09: 🚀 We released MiniCPM-GUIDance on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-guidance). - 2024-06-07: 📚 We released the datasets, loading code, and evaluation code on [github](https://github.com/yiye3/GUICourse). - 2024-03-09: 📦 We have open-sourced guicourse, [GUIAct](https://huggingface.co/datasets/yiye2023/GUIAct),[GUIChat](https://huggingface.co/datasets/yiye2023/GUIChat), [GUIEnv](https://huggingface.co/datasets/yiye2023/GUIEnv) # ToDo [ ] Batch inference # CookBook - Prompt for Actions ``` Your Task {Task} Generate next actions to do this task. ``` ``` Actions History {hover, select_text, click, scroll} Information {Information about the web} Your Task {TASK} Generate next actions to do this task. ``` - Prompt for Chat w or w/o Grounding ``` {Query} OR {Query} Grounding all objects in the image. ``` # Example Pip install all dependencies: ``` Pillow==10.1.0 timm==0.9.10 torch==2.1.2 torchvision==0.16.2 transformers==4.40.0 sentencepiece==0.1.99 flash_attn==2.4.2 ``` First you are suggested to git clone this huggingface repo or download repo with huggingface_cli. ``` git lfs install git clone https://huggingface.co/RhapsodyAI/minicpm-guidance ``` or ``` huggingface-cli download RhapsodyAI/minicpm-guidance ``` Example case image: ![case](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/KJFeGDBj3SOgQqGAU7lU5.png) ```python from transformers import AutoProcessor, AutoTokenizer, AutoModel from PIL import Image import torch MODEL_PATH = '/path/to/minicpm-guidance' tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True) # model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, attn_implementation="eager", torch_dtype=torch.bfloat16) model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16) model.cuda().eval() # Currently only support batch=1 example_messages = [ [ { "role": "user", "content": Image.open("./case.png").convert('RGB') }, { "role": "user", "content": "How could I use this model from this web? Grounding all objects in the image." } ] ] input = processor(example_messages, padding_side="right") for key in input: if isinstance(input[key], list): for i in range(len(input[key])): if isinstance(input[key][i], torch.Tensor): input[key][i] = input[key][i].cuda() if isinstance(input[key], torch.Tensor): input[key] = input[key].cuda() with torch.no_grad(): outputs = model.generate(input, max_new_tokens=1024, do_sample=False, num_beams=3) text = tokenizer.decode(outputs[0].cpu().tolist()) text = tokenizer.batch_decode(outputs.cpu().tolist()) for i in text: print('-'*20) print(i) ''' To use the model from this webpage, you would typically follow these steps: 1. **Access the Model**: Navigate to the section of the webpage where the model is described. In this case, it's under the heading "Use this model" 864 238 964 256 . 2. **Download the Model**: There should be a link or button that allows you to download the model. Look for a button or link that says "Download" or something similar. 3. **Install the Model**: Once you've downloaded the model, you'll need to install it on your system. This typically involves extracting the downloaded file and placing it in a directory where the model can be found. 4. **Use the Model**: After installation, you can use the model in your application or project. This might involve importing the model into your programming environment and using it to perform specific tasks. The exact steps would depend on the specifics of the model and the environment in which you're using it, but these are the general steps you would follow to use the model from this webpage. ''' ``` # Contact [Junbo Cui](mailto:cuijb2000@gmail.com) # Citation If you find our work useful, please consider citing us: ``` @misc{, title={GUICourse: From General Vision Language Models to Versatile GUI Agents}, author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun}, year={2024}, journal={arXiv preprint arXiv:2406.11317}, } ```