|
--- |
|
license: gpl-3.0 |
|
tags: |
|
- ui-automation |
|
- automation |
|
- agents |
|
- llm-agents |
|
- vision |
|
--- |
|
|
|
# Model card for PTA-Text - A *Text Only* Click Model |
|
|
|
|
|
# Table of Contents |
|
|
|
0. [TL;DR](#TL;DR) |
|
1. [Using the model](#running-the-model) |
|
2. [Contribution](#contribution) |
|
3. [Citation](#citation) |
|
|
|
# TL;DR |
|
|
|
## Details for PTA-Text: |
|
-> __Input__: An image with a header containing the desired UI click command. |
|
|
|
-> __Output__: [x,y] coordinate in relative coordinates 0-1 range. |
|
|
|
__PTA-Text__ is an image encoder based on Matcha, which is an extension of Pix2Struct |
|
|
|
# Installation |
|
|
|
```bash |
|
pip install askui-ml-helper |
|
``` |
|
|
|
Download the checkpoint ".pt" model from files in this model card. |
|
Or download it from your terminal |
|
```bash |
|
curl -L "https://huggingface.co/AskUI/pta-text-0.1/resolve/main/pta-text-v0.1.pt?download=true" -o pta-text-v0.1.pt |
|
``` |
|
|
|
## Running the model |
|
|
|
### Get the annotated image |
|
|
|
You can run the model in full precision on CPU: |
|
```python |
|
import requests |
|
from PIL import Image |
|
from askui_ml_helper.utils.pta_text import PtaTextInference |
|
|
|
pta_text_inference = PtaTextInference("pta-text-v0.1.pt") |
|
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
prompt = 'click on the text "Operating System"' |
|
|
|
render_image = pta_text_inference.process_image_and_draw_circle(image, prompt, radius=15) |
|
render_image.show() |
|
>>> Uploaded image with "a red dot", where click operation is predicted |
|
``` |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f993a63777efc07d7f1e2ce/ZNwjdENJqn-1VpXDcm_Wg.png) |
|
|
|
### Get the coordinates |
|
|
|
```python |
|
import requests |
|
from PIL import Image |
|
from askui_ml_helper.utils.pta_text import PtaTextInference |
|
|
|
pta_text_inference = PtaTextInference("pta-text-v0.1.pt") |
|
url = "https://docs.askui.com/assets/images/how_askui_works_architecture-363bc8be35bd228e884c83d15acd19f7.png" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
prompt = 'click on the text "Operating System"' |
|
|
|
coordinates = pta_text_inference.process_image(image, prompt) |
|
coordinates |
|
>>> [0.3981265723705292, 0.13768285512924194] |
|
``` |
|
|
|
# Contribution |
|
|
|
An AskUI's open source initiative. This model is contributed and added to the Hugging Face ecosystem by [Murali Manohar @ AskUI](https://huggingface.co/gitlost-murali). |
|
|
|
# Citation |
|
|
|
TODO |
|
|