|
--- |
|
license: odc-by |
|
--- |
|
#### Model for the paper: [Harnessing Webpage Uis For Text Rich Visual Understanding]() |
|
|
|
🌐 [Homepage](https://neulab.github.io/MultiUI/) | 🐍 [GitHub](https://github.com/neulab/multiui) | 📖 [arXiv]() |
|
|
|
## Introduction |
|
We introduce **MultiUI**, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi- modal tasks and UI layouts. Models trained on **MultiUI** not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web—but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. |
|
|
|
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/vk7yT4Y7ydBOHM6BojmlI.mp4"></video> |
|
|
|
## Model Performance |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
## Contact |
|
* Junpeng Liu: jpliu@link.cuhk.edu.hk |
|
* Xiang Yue: xyue2@andrew.cmu.edu |
|
|
|
## Citation |
|
If you find this work helpful, please cite out paper: |