Spaces:
Build error
Build error
title: SEER | |
emoji: π | |
colorFrom: green | |
colorTo: yellow | |
sdk: gradio | |
sdk_version: 5.6.0 | |
app_file: app.py | |
pinned: false | |
license: cc-by-nc-nd-4.0 | |
# SEER: Enhancing Multimodal Action Grounding with Semantic UI Parsing | |
## Overview | |
**SEER** (Semantic Extraction for Enhanced Reasoning) enhances GPT-4V's capability to interact with graphical user interfaces (GUIs) by extracting semantic meanings from UI elements and identifying interactable regions. This enables region-grounded actions, making multimodal environments more intuitive and interactive. | |
SEER pushes the boundaries of multimodal AI by combining vision-based UI parsing and advanced machine learning techniques to create structured, actionable insights from GUI screenshots. | |
--- | |
## Features | |
- **Semantic Parsing**: Extracts UI elements' semantic meanings, categorizing buttons, icons, and text regions. | |
- **Interactive Region Detection**: Accurately identifies clickable or interactable regions. | |
- **Local Semantics Integration**: Enriches region data with descriptive functionality to improve context comprehension. | |
- **Region-Grounded Actions**: Enables action generation contextualized to GUI regions. | |
- **Gradio Demo**: Provides an interactive interface for testing SEER's capabilities. | |
--- | |
## Methodology Highlights | |
SEER is designed to decompose complex tasks into structured steps, leveraging multiple components to alleviate computational burdens on GPT-4V and enhance decision-making accuracy. | |
### 1. **Screen Parsing** | |
SEER integrates outputs from three key components to produce a structured, DOM-like representation of the UI, overlayed with bounding boxes for interactable elements: | |
- **Finetuned Interactable Icon Detection Model** | |
- **Finetuned Icon Description Model** | |
- **OCR Module** | |
This parsing process simplifies GPT-4V's tasks by focusing on semantic and functional information extraction. | |
### 2. **Interactable Region Detection** | |
Identifying interactable regions is a foundational step: | |
- A custom dataset of **67k UI screenshots** was curated, with bounding boxes derived from DOM trees of public web pages. | |
- Bounding boxes from interactable region detection and OCR modules are merged while minimizing overlaps (overlap threshold > 90%). | |
- Each region is assigned a unique ID to facilitate precise action mapping. | |
### 3. **Local Semantics of Functionality** | |
To improve understanding of UI elements: | |
- A dataset of **7k icon-description pairs** was curated using GPT-4o and used to finetune a BLIP-v2 model. | |
- This finetuned model generates accurate descriptions of icon functionality. | |
- The descriptions and detected texts are integrated into prompts alongside the UI screenshot. | |
By incorporating local semantics, SEER addresses limitations in GPT-4V's ability to simultaneously identify semantic information and predict actions. This integration significantly enhances its performance in multimodal tasks. | |
--- | |
## Installation | |
1. **Create and activate the Python environment:** | |
```bash | |
conda create -n "seer" python=3.12 | |
conda activate seer | |
pip install -r requirements.txt | |
``` | |
2. **Download pre-trained model checkpoints:** | |
Access models from [HuggingFace](https://huggingface.co/strikerhell/SEER-model) and place them in the appropriate directories: | |
- `weights/icon_detect/` | |
- `weights/icon_caption_florence/` | |
- `weights/icon_caption_blip2/` | |
3. **Convert safetensor files to PyTorch format:** | |
```bash | |
python weights/convert_safetensor_to_pt.py | |
``` | |
--- | |
## Usage | |
### Run Gradio Demo | |
Test SEER's capabilities with the Gradio-powered demo: | |
```bash | |
python gradio_demo.py | |
``` | |
### Example Notebook | |
Explore SEER with sample use cases provided in `demo.ipynb`. | |
--- | |
## Project Details | |
### Pipeline Components | |
- **UI Parsing Model**: Decomposes screenshots into actionable semantic data. | |
- **Interactive Region Detection**: Recognizes clickable and functional UI areas. | |
- **Action Grounding Framework**: Maps detected actions to the corresponding GUI elements. | |
- **Local Semantics Integration**: Enriches UI parsing with descriptive labels for improved task performance. | |
--- | |
## Contributing | |
Feel free to fork the repository, make changes, and submit pull requests. Contributions are welcome! | |
## License | |
This project is licensed under the CC 4.0 License. See the [LICENSE](LICENSE) file for details. | |
## Author | |
**Pirate-Emperor** | |
[](https://twitter.com/PirateKingRahul) | |
[](https://discord.com/users/1200728704981143634) | |
[](https://www.linkedin.com/in/piratekingrahul) | |
[](https://www.reddit.com/u/PirateKingRahul) | |
[](https://medium.com/@piratekingrahul) | |
- GitHub: [Pirate-Emperor](https://github.com/Pirate-Emperor) | |
- Reddit: [PirateKingRahul](https://www.reddit.com/u/PirateKingRahul/) | |
- Twitter: [PirateKingRahul](https://twitter.com/PirateKingRahul) | |
- Discord: [PirateKingRahul](https://discord.com/users/1200728704981143634) | |
- LinkedIn: [PirateKingRahul](https://www.linkedin.com/in/piratekingrahul) | |
- Skype: [Join Skype](https://join.skype.com/invite/yfjOJG3wv9Ki) | |
- Medium: [PirateKingRahul](https://medium.com/@piratekingrahul) | |
Thank you for visiting the SEER project! | |
--- | |