zRzRzRzRzRzRzR commited on
Commit
892d81d
1 Parent(s): 0751d1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md CHANGED
@@ -1,3 +1,154 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
  ---
6
+ # CogAgent
7
+
8
+ ## Introduction
9
+
10
+ **CogAgent** is an open-source visual language model improved based on **CogVLM**.
11
+
12
+ **CogAgent-18B** has 11 billion visual parameters and 7 billion language parameters and achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including:
13
+ + VQAv2
14
+ + OK-VQ
15
+ + TextVQA
16
+ + ST-VQA
17
+ + ChartQA
18
+ + infoVQA
19
+ + DocVQA
20
+ + MM-Vet
21
+ + POPE
22
+
23
+ **CogAgent-18B** significantly surpasses existing models on GUI operation datasets such as AITW and Mind2Web.
24
+
25
+ In addition to all the features already present in **CogVLM** (visual multi-round dialogue, visual grounding), **CogAgent**:
26
+
27
+ 1. Supports higher resolution visual input and dialogue question-answering. It supports ultra-high-resolution image inputs of **1120x1120**.
28
+
29
+ 2. Possesses the capabilities of a visual Agent, being able to return a plan, next action, and specific operations with coordinates for any given task on any GUI screenshot.
30
+
31
+ 3. Enhanced GUI-related question-answering capabilities, allowing it to handle questions about any GUI screenshot, such as web pages, PC apps, mobile applications, etc.
32
+
33
+ 4. Enhanced capabilities in OCR-related tasks through improved pre-training and fine-tuning.
34
+
35
+ <div align="center">
36
+ <img src="https://raw.githubusercontent.com/THUDM/CogVLM/master/assets/cogagent_function.jpg" alt="img" style="zoom: 50%;" />
37
+ </div>
38
+
39
+ ## Quick Start
40
+
41
+ use this python code to get started quickly in `cli_demo.py`:
42
+
43
+ ```python
44
+ import torch
45
+ from PIL import Image
46
+ from transformers import AutoModelForCausalLM, LlamaTokenizer
47
+ import argparse
48
+
49
+ parser = argparse.ArgumentParser()
50
+ parser.add_argument("--quant", choices=[4], type=int, default=None, help='quantization bits')
51
+ parser.add_argument("--from_pretrained", type=str, default="THUDM/cogagent-chat-hf", help='pretrained ckpt')
52
+ parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
53
+ parser.add_argument("--fp16", action="store_true")
54
+ parser.add_argument("--bf16", action="store_true")
55
+
56
+ args = parser.parse_args()
57
+ MODEL_PATH = args.from_pretrained
58
+ TOKENIZER_PATH = args.local_tokenizer
59
+ DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
60
+
61
+ tokenizer = LlamaTokenizer.from_pretrained(TOKENIZER_PATH)
62
+ if args.bf16:
63
+ torch_type = torch.bfloat16
64
+ else:
65
+ torch_type = torch.float16
66
+
67
+ print("========Use torch type as:{} with device:{}========\n\n".format(torch_type, DEVICE))
68
+
69
+ if args.quant:
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ MODEL_PATH,
72
+ torch_dtype=torch_type,
73
+ low_cpu_mem_usage=True,
74
+ load_in_4bit=True,
75
+ trust_remote_code=True
76
+ ).eval()
77
+ else:
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ MODEL_PATH,
80
+ torch_dtype=torch_type,
81
+ low_cpu_mem_usage=True,
82
+ load_in_4bit=args.quant is not None,
83
+ trust_remote_code=True
84
+ ).to(DEVICE).eval()
85
+
86
+ while True:
87
+ image_path = input("image path >>>>> ")
88
+ if image_path == "stop":
89
+ break
90
+
91
+ image = Image.open(image_path).convert('RGB')
92
+ history = []
93
+ while True:
94
+ query = input("Human:")
95
+ if query == "clear":
96
+ break
97
+ input_by_model = model.build_conversation_input_ids(tokenizer, query=query, history=history, images=[image])
98
+ inputs = {
99
+ 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
100
+ 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
101
+ 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
102
+ 'images': [[input_by_model['images'][0].to(DEVICE).to(torch_type)]],
103
+ }
104
+ if 'cross_images' in input_by_model and input_by_model['cross_images']:
105
+ inputs['cross_images'] = [[input_by_model['cross_images'][0].to(DEVICE).to(torch_type)]]
106
+
107
+ # add any transformers params here.
108
+ gen_kwargs = {"max_length": 2048,
109
+ "temperature": 0.9,
110
+ "do_sample": False}
111
+ with torch.no_grad():
112
+ outputs = model.generate(**inputs, **gen_kwargs)
113
+ outputs = outputs[:, inputs['input_ids'].shape[1]:]
114
+ response = tokenizer.decode(outputs[0])
115
+ response = response.split("</s>")[0]
116
+ print("\nCog:", response)
117
+ history.append((query, response))
118
+ ```
119
+
120
+ Then run:
121
+
122
+ ```bash
123
+ python cli_demo_hf.py --bf16
124
+ ```
125
+ for more information such as Web Demo and Finetune, please refer to [Our GitHub](https://github.com/THUDM/CogVLM/)
126
+
127
+ ## License
128
+
129
+ The code in this repository is open source under the [Apache-2.0 license](./LICENSE), while the use of CogAgent and CogVLM model weights must comply with the [Model License](./MODEL_LICENSE).
130
+
131
+ ## Citation & Acknowledgements
132
+
133
+ If you find our work helpful, please consider citing the following papers
134
+
135
+ ```
136
+ @misc{hong2023cogagent,
137
+ title={CogAgent: A Visual Language Model for GUI Agents},
138
+ author={Wenyi Hong and Weihan Wang and Qingsong Lv and Jiazheng Xu and Wenmeng Yu and Junhui Ji and Yan Wang and Zihan Wang and Yuxiao Dong and Ming Ding and Jie Tang},
139
+ year={2023},
140
+ eprint={2312.08914},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.CV}
143
+ }
144
+
145
+ @misc{wang2023cogvlm,
146
+ title={CogVLM: Visual Expert for Pretrained Language Models},
147
+ author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
148
+ year={2023},
149
+ eprint={2311.03079},
150
+ archivePrefix={arXiv},
151
+ primaryClass={cs.CV}
152
+ }
153
+ ```
154
+ In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from the [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR) and [Shikra](https://github.com/shikras/shikra) projects, as well as many classic cross-modal work datasets. We sincerely thank them for their contributions.