Rodeszones commited on
Commit
14b2c3a
1 Parent(s): e79cbd4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -21
README.md CHANGED
@@ -6,29 +6,30 @@ pipeline_tag: visual-question-answering
6
 
7
  # CogVLM
8
 
9
- **CogVLM** 是一个强大的开源视觉语言模型(VLM)。CogVLM-17B 拥有 100 亿视觉参数和 70 亿语言参数,在 10 个经典跨模态基准测试上取得了 SOTA 性能,包括 NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC,而在 VQAv2、OKVQA、TextVQA、COCO captioning 等方面则排名第二,超越或与 PaLI-X 55B 持平。您可以通过线上 [demo](http://36.103.203.44:7861/) 体验 CogVLM 多模态对话。
10
-
11
- **CogVLM** is a powerful **open-source visual language model** (**VLM**). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., **surpassing or matching PaLI-X 55B**. CogVLM can also [chat with you](http://36.103.203.44:7861/) about images.
12
-
13
- <div align="center">
14
- <img src="https://github.com/THUDM/CogVLM/raw/main/assets/metrics-min.png" alt="img" style="zoom: 50%;" />
15
- </div>
16
- # 快速开始(Qiuckstart)
17
 
18
  ```python
19
  import torch
 
20
  from PIL import Image
21
  from transformers import AutoModelForCausalLM, LlamaTokenizer
 
 
 
 
22
  tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
23
  model = AutoModelForCausalLM.from_pretrained(
24
- 'THUDM/cogvlm-grounding-generalist-hf',
25
  torch_dtype=torch.bfloat16,
26
  low_cpu_mem_usage=True,
27
  trust_remote_code=True
28
- ).to('cuda').eval()
 
 
 
29
  query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
30
- image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/4.jpg?raw=true', stream=True).raw).convert('RGB')
31
- inputs = model.build_conversation_input_ids(tokenizer, query=query, images=[image])
32
  inputs = {
33
  'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
34
  'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
@@ -36,21 +37,14 @@ inputs = {
36
  'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
37
  }
38
  gen_kwargs = {"max_length": 2048, "do_sample": False}
 
39
  with torch.no_grad():
40
  outputs = model.generate(**inputs, **gen_kwargs)
41
  outputs = outputs[:, inputs['input_ids'].shape[1]:]
42
  print(tokenizer.decode(outputs[0]))
 
43
  ```
44
 
45
- # 方法(Method)
46
-
47
- CogVLM 模型包括四个基本组件:视觉变换器(ViT)编码器、MLP适配器、预训练的大型语言模型(GPT)和一个**视觉专家模块**。更多细节请参见[Paper](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf)。
48
-
49
- CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a **visual expert module**. See [Paper](https://github.com/THUDM/CogVLM/blob/main/assets/cogvlm-paper.pdf) for more details.
50
-
51
- <div align="center">
52
- <img src="https://github.com/THUDM/CogVLM/raw/main/assets/method-min.png" style="zoom:50%;" />
53
- </div>
54
  # 许可(License)
55
 
56
  此存储库中的代码是根据 [Apache-2.0 许可](https://github.com/THUDM/CogVLM/raw/main/LICENSE) 开放源码,而使用 CogVLM 模型权重必须遵循 [模型许可](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE)。
 
6
 
7
  # CogVLM
8
 
9
+ # Qiuckstart
 
 
 
 
 
 
 
10
 
11
  ```python
12
  import torch
13
+ import requests
14
  from PIL import Image
15
  from transformers import AutoModelForCausalLM, LlamaTokenizer
16
+
17
+ model_path = 'Model/folder/path/here'
18
+
19
+
20
  tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
21
  model = AutoModelForCausalLM.from_pretrained(
22
+ model_path,
23
  torch_dtype=torch.bfloat16,
24
  low_cpu_mem_usage=True,
25
  trust_remote_code=True
26
+ ).eval()
27
+
28
+
29
+ # chat example
30
  query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
31
+ image = Image.open("your/image/path/here").convert('RGB')
32
+ inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
33
  inputs = {
34
  'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
35
  'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
 
37
  'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
38
  }
39
  gen_kwargs = {"max_length": 2048, "do_sample": False}
40
+
41
  with torch.no_grad():
42
  outputs = model.generate(**inputs, **gen_kwargs)
43
  outputs = outputs[:, inputs['input_ids'].shape[1]:]
44
  print(tokenizer.decode(outputs[0]))
45
+
46
  ```
47
 
 
 
 
 
 
 
 
 
 
48
  # 许可(License)
49
 
50
  此存储库中的代码是根据 [Apache-2.0 许可](https://github.com/THUDM/CogVLM/raw/main/LICENSE) 开放源码,而使用 CogVLM 模型权重必须遵循 [模型许可](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE)。