File size: 5,526 Bytes
9e91038
 
 
 
 
 
 
 
 
 
 
 
e2a72e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
906460f
e2a72e7
 
 
 
 
 
 
 
 
 
906460f
 
e2a72e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: SEER
emoji: πŸ‘
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.6.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
---

# SEER: Enhancing Multimodal Action Grounding with Semantic UI Parsing

## Overview

**SEER** (Semantic Extraction for Enhanced Reasoning) enhances GPT-4V's capability to interact with graphical user interfaces (GUIs) by extracting semantic meanings from UI elements and identifying interactable regions. This enables region-grounded actions, making multimodal environments more intuitive and interactive.

SEER pushes the boundaries of multimodal AI by combining vision-based UI parsing and advanced machine learning techniques to create structured, actionable insights from GUI screenshots.

---

## Features

- **Semantic Parsing**: Extracts UI elements' semantic meanings, categorizing buttons, icons, and text regions.  
- **Interactive Region Detection**: Accurately identifies clickable or interactable regions.  
- **Local Semantics Integration**: Enriches region data with descriptive functionality to improve context comprehension.  
- **Region-Grounded Actions**: Enables action generation contextualized to GUI regions.  
- **Gradio Demo**: Provides an interactive interface for testing SEER's capabilities.  

---

## Methodology Highlights

SEER is designed to decompose complex tasks into structured steps, leveraging multiple components to alleviate computational burdens on GPT-4V and enhance decision-making accuracy.

### 1. **Screen Parsing**

SEER integrates outputs from three key components to produce a structured, DOM-like representation of the UI, overlayed with bounding boxes for interactable elements:
- **Finetuned Interactable Icon Detection Model**
- **Finetuned Icon Description Model**
- **OCR Module**

This parsing process simplifies GPT-4V's tasks by focusing on semantic and functional information extraction.

### 2. **Interactable Region Detection**

Identifying interactable regions is a foundational step:
- A custom dataset of **67k UI screenshots** was curated, with bounding boxes derived from DOM trees of public web pages.  
- Bounding boxes from interactable region detection and OCR modules are merged while minimizing overlaps (overlap threshold > 90%).  
- Each region is assigned a unique ID to facilitate precise action mapping.  

### 3. **Local Semantics of Functionality**

To improve understanding of UI elements:
- A dataset of **7k icon-description pairs** was curated using GPT-4o and used to finetune a BLIP-v2 model.  
- This finetuned model generates accurate descriptions of icon functionality.  
- The descriptions and detected texts are integrated into prompts alongside the UI screenshot.  

By incorporating local semantics, SEER addresses limitations in GPT-4V's ability to simultaneously identify semantic information and predict actions. This integration significantly enhances its performance in multimodal tasks.

---

## Installation

1. **Create and activate the Python environment:**
   ```bash
   conda create -n "seer" python=3.12
   conda activate seer
   pip install -r requirements.txt
   ```

2. **Download pre-trained model checkpoints:**  
   Access models from [HuggingFace](https://huggingface.co/strikerhell/SEER-model) and place them in the appropriate directories:  
   - `weights/icon_detect/`  
   - `weights/icon_caption_florence/`  
   - `weights/icon_caption_blip2/`

3. **Convert safetensor files to PyTorch format:**  
   ```bash
   python weights/convert_safetensor_to_pt.py
   ```

---

## Usage

### Run Gradio Demo

Test SEER's capabilities with the Gradio-powered demo:
```bash
python gradio_demo.py
```

### Example Notebook

Explore SEER with sample use cases provided in `demo.ipynb`.

---

## Project Details

### Pipeline Components

- **UI Parsing Model**: Decomposes screenshots into actionable semantic data.
- **Interactive Region Detection**: Recognizes clickable and functional UI areas.
- **Action Grounding Framework**: Maps detected actions to the corresponding GUI elements.
- **Local Semantics Integration**: Enriches UI parsing with descriptive labels for improved task performance.

---

## Contributing

Feel free to fork the repository, make changes, and submit pull requests. Contributions are welcome!

## License

This project is licensed under the CC 4.0 License. See the [LICENSE](LICENSE) file for details.

## Author

**Pirate-Emperor**

[![Twitter](https://skillicons.dev/icons?i=twitter)](https://twitter.com/PirateKingRahul)
[![Discord](https://skillicons.dev/icons?i=discord)](https://discord.com/users/1200728704981143634)
[![LinkedIn](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/in/piratekingrahul)

[![Reddit](https://img.shields.io/badge/Reddit-FF5700?style=for-the-badge&logo=reddit&logoColor=white)](https://www.reddit.com/u/PirateKingRahul)
[![Medium](https://img.shields.io/badge/Medium-42404E?style=for-the-badge&logo=medium&logoColor=white)](https://medium.com/@piratekingrahul)

- GitHub: [Pirate-Emperor](https://github.com/Pirate-Emperor)
- Reddit: [PirateKingRahul](https://www.reddit.com/u/PirateKingRahul/)
- Twitter: [PirateKingRahul](https://twitter.com/PirateKingRahul)
- Discord: [PirateKingRahul](https://discord.com/users/1200728704981143634)
- LinkedIn: [PirateKingRahul](https://www.linkedin.com/in/piratekingrahul)
- Skype: [Join Skype](https://join.skype.com/invite/yfjOJG3wv9Ki)
- Medium: [PirateKingRahul](https://medium.com/@piratekingrahul)

Thank you for visiting the SEER project!

---