Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,71 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
4 |
-
|
5 |
|
6 |
-
[
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
---
|
6 |
+
# CogAgent
|
7 |
|
8 |
+
### Reminder: This is the repository for CogAgent of [SAT (SwissArmyTransformer)](https://github.com/THUDM/SwissArmyTransformer/) version.
|
9 |
+
|
10 |
+
### Please refer to [https://huggingface.co/THUDM/cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf) for CogAgent of Huggingface version.
|
11 |
+
|
12 |
+
## Introduction
|
13 |
+
|
14 |
+
**CogAgent** is an open-source visual language model improved based on **CogVLM**. CogAgent-18B has 11 billion visual and 7 billion language parameters.
|
15 |
+
|
16 |
+
📖 Paper: https://arxiv.org/abs/2312.08914
|
17 |
+
|
18 |
+
🚀 GitHub: For more information, please refer to [Our GitHub](https://github.com/THUDM/CogVLM/)
|
19 |
+
|
20 |
+
CogAgent demonstrates **strong performance** in image understanding and GUI agent:
|
21 |
+
|
22 |
+
1. CogAgent-18B **achieves state-of-the-art generalist performance on 9 cross-modal benchmarks**, including: VQAv2, MM-Vet, POPE, ST-VQA, OK-VQA, TextVQA, ChartQA, InfoVQA, DocVQA.
|
23 |
+
|
24 |
+
2. CogAgent-18B significantly **surpasses existing models on GUI operation datasets**, including AITW and Mind2Web.
|
25 |
+
|
26 |
+
In addition to all the **features** already present in **CogVLM** (visual multi-round dialogue, visual grounding), **CogAgent**:
|
27 |
+
|
28 |
+
1. Supports higher resolution visual input and dialogue question-answering. It supports ultra-high-resolution image inputs of **1120x1120**.
|
29 |
+
|
30 |
+
2. Possesses the capabilities of a visual Agent, being able to return a plan, next action, and specific operations with coordinates for any given task on any GUI screenshot.
|
31 |
+
|
32 |
+
3. Enhanced GUI-related question-answering capabilities, allowing it to handle questions about any GUI screenshot, such as web pages, PC apps, mobile applications, etc.
|
33 |
+
|
34 |
+
4. Enhanced capabilities in OCR-related tasks through improved pre-training and fine-tuning.
|
35 |
+
|
36 |
+
<div align="center">
|
37 |
+
<img src="https://raw.githubusercontent.com/THUDM/CogVLM/master/assets/cogagent_function.jpg" alt="img" style="zoom: 50%;" />
|
38 |
+
</div>
|
39 |
+
|
40 |
+
## Quick Start
|
41 |
+
|
42 |
+
Please refer to the instructions located at [our GitHub - section cli-SAT](https://github.com/THUDM/CogVLM?tab=readme-ov-file#situation-21-cli-sat-version) for inference and fine-tuning of the SAT version of the model.
|
43 |
+
|
44 |
+
You only need to use a command for easy inference.
|
45 |
+
|
46 |
+
|
47 |
+
```bash
|
48 |
+
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
|
49 |
+
```
|
50 |
+
|
51 |
+
|
52 |
+
## License
|
53 |
+
|
54 |
+
The code in this repository is open source under the [Apache-2.0 license](./LICENSE), while the use of CogAgent and CogVLM model weights must comply with the [Model License](./MODEL_LICENSE).
|
55 |
+
|
56 |
+
## Citation & Acknowledgements
|
57 |
+
|
58 |
+
If you find our work helpful, please consider citing the following papers
|
59 |
+
|
60 |
+
```
|
61 |
+
@misc{hong2023cogagent,
|
62 |
+
title={CogAgent: A Visual Language Model for GUI Agents},
|
63 |
+
author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
|
64 |
+
year={2023},
|
65 |
+
eprint={2312.08914},
|
66 |
+
archivePrefix={arXiv},
|
67 |
+
primaryClass={cs.CV}
|
68 |
+
}
|
69 |
+
|
70 |
+
```
|
71 |
+
In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from the [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR) and [Shikra](https://github.com/shikras/shikra) projects, as well as many classic cross-modal work datasets. We sincerely thank them for their contributions.
|