TinyClick / README.md
pawlowskipawel's picture
Update README.md
a77fadb verified
metadata
license: mit
base_model: microsoft/Florence-2-base

arXiv MIT License


TinyClick: Single-Turn Agent for Empowering GUI Automation

The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation

About The Project

We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.

Usage

To set up the environment for running the code, please refer to the GitHub repository. All necessary libraries and dependencies are listed in the requirements.txt file

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(
    "Samsung/TinyClick", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Samsung/TinyClick",
    trust_remote_code=True,
).to(device)

url = "https://huggingface.co/Samsung/TinyClick/resolve/main/sample.png"
img = Image.open(requests.get(url, stream=True).raw)

command = "click on accept and continue button"
image_size = img.size

input_text = ("What to do to execute the command? " + command.strip()).lower()

inputs = processor(
    images=img,
    text=input_text,
    return_tensors="pt",
    do_resize=True,
)

outputs = model.generate(**inputs)
generated_texts = processor.batch_decode(outputs, skip_special_tokens=False)

For postprocessing fuction go to our github repository: https://github.com/SamsungLabs/TinyClick

from tinyclick_utils import postprocess

result = postprocess(generated_texts[0], image_size)

Citation

@misc{pawlowski2024tinyclicksingleturnagentempowering,
    title={TinyClick: Single-Turn Agent for Empowering GUI Automation}, 
    author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz},
    year={2024},
    eprint={2410.11871},
    archivePrefix={arXiv},
    primaryClass={cs.HC},
    url={https://arxiv.org/abs/2410.11871}, 
}

License

Please check the MIT license that is listed in this repository. See LICENSE for more information.

(back to top)