license: odc-by
datasets:
- neulab/MultiUI
language:
- en
base_model:
- Qwen/Qwen2-7B-Instruct
tags:
- GUI
- Agent
- Web
- OCR
- Doc
- VQA
Model for the paper: Harnessing Webpage Uis For Text Rich Visual Understanding
🌐 Homepage | 🐍 GitHub | 📖 arXiv
Introduction
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi- modal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web—but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation.
Training & Evaluation
The model training is based on the LLaVA-NeXT.
For deployment, refer to SGLang deployment section in LLaVA-NeXT repo.
For benchmark evaluation, the awesome lmms-eval package is used. Check our repo MultiUI to evaluate on benchmarks mentioned in the paper.
Model Performance
Contact
- Junpeng Liu: jpliu@link.cuhk.edu.hk
- Xiang Yue: xyue2@andrew.cmu.edu
Citation
If you find this work helpful, please cite out paper:
@misc{liu2024harnessingwebpageuistextrich,
title={Harnessing Webpage UIs for Text-Rich Visual Understanding},
author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue},
year={2024},
eprint={2410.13824},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.13824},
}