File size: 2,435 Bytes
753f935
 
423499e
d866300
 
ac587a6
 
07fb5ea
d866300
 
48c610d
3d9619d
da21963
3d9619d
ac587a6
d068b7e
 
 
c659477
da21963
3d9619d
 
 
 
c659477
 
 
3d9619d
c659477
1426edd
c659477
 
 
a5dd8d6
 
ac587a6
c659477
 
 
 
 
1426edd
c659477
 
 
 
 
 
1426edd
 
c659477
 
ac587a6
3d9619d
ac587a6
c659477
 
 
 
 
 
 
 
 
 
 
 
 
d866300
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
pipeline_tag: image-text-to-text
---

Ferret-UI is the first UI-centric multimodal large language model (MLLM) designed for referring, grounding, and reasoning tasks.
Built on Gemma-2B and Llama-3-8B, it is capable of executing complex UI tasks.
This is the **Gemma-2B** version of ferret-ui. It follows from [this paper](https://arxiv.org/pdf/2404.05719) by Apple.


## How to Use 🤗📱

You will need first to download `builder.py`, `conversation.py`, `inference.py`, `model_UI.py`, and `mm_utils.py` locally.

```bash
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/conversation.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/builder.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/inference.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/model_UI.py
wget https://huggingface.co/jadechoghari/Ferret-UI-Gemma2b/raw/main/mm_utils.py
```

### Usage:
```python
from inference import inference_and_run
image_path = "appstore_reminders.png"
prompt = "Describe the image in details"

# Call the function without a box
inference_text = inference_and_run(image_path, prompt, conv_mode="ferret_gemma_instruct", model_path="jadechoghari/Ferret-UI-Gemma2b")

# Output processed text
print("Inference Text:", inference_text)
```

```python
# Task with bounding boxes
image_path = "appstore_reminders.png"
prompt = "What's inside the selected region?"
box = [189, 906, 404, 970]

inference_text = inference_and_run(
    image_path=image_path, 
    prompt=prompt, 
    conv_mode="ferret_gemma_instruct", 
    model_path="jadechoghari/Ferret-UI-Gemma2b", 
    box=box
)
# you could also pass process_image=True
# to output: processed_image, inference_text = inference_and_run(...., process_image=True)

print("Inference Text:", inference_text)
```

```python
# GROUNDING PROMPTS
GROUNDING_TEMPLATES = [
	'\nProvide the bounding boxes of the mentioned objects.',
 	'\nInclude the coordinates for each mentioned object.',
	'\nLocate the objects with their coordinates.',
	'\nAnswer in [x1, y1, x2, y2] format.',
	'\nMention the objects and their locations using the format [x1, y1, x2, y2].',
	'\nDraw boxes around the mentioned objects.',
	'\nUse boxes to show where each thing is.',
	'\nTell me where the objects are with coordinates.',
	'\nList where each object is with boxes.',
	'\nShow me the regions with boxes.'
]
```