Rodeszones
commited on
Commit
•
14b2c3a
1
Parent(s):
e79cbd4
Update README.md
Browse files
README.md
CHANGED
@@ -6,29 +6,30 @@ pipeline_tag: visual-question-answering
|
|
6 |
|
7 |
# CogVLM
|
8 |
|
9 |
-
|
10 |
-
|
11 |
-
**CogVLM** is a powerful **open-source visual language model** (**VLM**). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., **surpassing or matching PaLI-X 55B**. CogVLM can also [chat with you](http://36.103.203.44:7861/) about images.
|
12 |
-
|
13 |
-
<div align="center">
|
14 |
-
<img src="https://github.com/THUDM/CogVLM/raw/main/assets/metrics-min.png" alt="img" style="zoom: 50%;" />
|
15 |
-
</div>
|
16 |
-
# 快速开始(Qiuckstart)
|
17 |
|
18 |
```python
|
19 |
import torch
|
|
|
20 |
from PIL import Image
|
21 |
from transformers import AutoModelForCausalLM, LlamaTokenizer
|
|
|
|
|
|
|
|
|
22 |
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
|
23 |
model = AutoModelForCausalLM.from_pretrained(
|
24 |
-
|
25 |
torch_dtype=torch.bfloat16,
|
26 |
low_cpu_mem_usage=True,
|
27 |
trust_remote_code=True
|
28 |
-
).
|
|
|
|
|
|
|
29 |
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
|
30 |
-
image = Image.open(
|
31 |
-
inputs = model.build_conversation_input_ids(tokenizer, query=query, images=[image])
|
32 |
inputs = {
|
33 |
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
|
34 |
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
|
@@ -36,21 +37,14 @@ inputs = {
|
|
36 |
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
|
37 |
}
|
38 |
gen_kwargs = {"max_length": 2048, "do_sample": False}
|
|
|
39 |
with torch.no_grad():
|
40 |
outputs = model.generate(**inputs, **gen_kwargs)
|
41 |
outputs = outputs[:, inputs['input_ids'].shape[1]:]
|
42 |
print(tokenizer.decode(outputs[0]))
|
|
|
43 |
```
|
44 |
|
45 |
-
# 方法(Method)
|
46 |
-
|
47 |
-
CogVLM 模型包括四个基本组件:视觉变换器(ViT)编码器、MLP适配器、预训练的大型语言模型(GPT)和一个**视觉专家模块**。更多细节请参见[Paper](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf)。
|
48 |
-
|
49 |
-
CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a **visual expert module**. See [Paper](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf) for more details.
|
50 |
-
|
51 |
-
<div align="center">
|
52 |
-
<img src="https://github.com/THUDM/CogVLM/raw/main/assets/method-min.png" style="zoom:50%;" />
|
53 |
-
</div>
|
54 |
# 许可(License)
|
55 |
|
56 |
此存储库中的代码是根据 [Apache-2.0 许可](https://github.com/THUDM/CogVLM/raw/main/LICENSE) 开放源码,而使用 CogVLM 模型权重必须遵循 [模型许可](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE)。
|
|
|
6 |
|
7 |
# CogVLM
|
8 |
|
9 |
+
# Qiuckstart
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
```python
|
12 |
import torch
|
13 |
+
import requests
|
14 |
from PIL import Image
|
15 |
from transformers import AutoModelForCausalLM, LlamaTokenizer
|
16 |
+
|
17 |
+
model_path = 'Model/folder/path/here'
|
18 |
+
|
19 |
+
|
20 |
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
|
21 |
model = AutoModelForCausalLM.from_pretrained(
|
22 |
+
model_path,
|
23 |
torch_dtype=torch.bfloat16,
|
24 |
low_cpu_mem_usage=True,
|
25 |
trust_remote_code=True
|
26 |
+
).eval()
|
27 |
+
|
28 |
+
|
29 |
+
# chat example
|
30 |
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
|
31 |
+
image = Image.open("your/image/path/here").convert('RGB')
|
32 |
+
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
|
33 |
inputs = {
|
34 |
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
|
35 |
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
|
|
|
37 |
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
|
38 |
}
|
39 |
gen_kwargs = {"max_length": 2048, "do_sample": False}
|
40 |
+
|
41 |
with torch.no_grad():
|
42 |
outputs = model.generate(**inputs, **gen_kwargs)
|
43 |
outputs = outputs[:, inputs['input_ids'].shape[1]:]
|
44 |
print(tokenizer.decode(outputs[0]))
|
45 |
+
|
46 |
```
|
47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
# 许可(License)
|
49 |
|
50 |
此存储库中的代码是根据 [Apache-2.0 许可](https://github.com/THUDM/CogVLM/raw/main/LICENSE) 开放源码,而使用 CogVLM 模型权重必须遵循 [模型许可](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE)。
|