File size: 3,555 Bytes
e94d005 3000fff e79cbd4 79b7662 da1dd6f 9eabd7b 67159bb 7b16d06 5b9c502 7b16d06 e3c1819 7b16d06 0943e9a e79cbd4 14b2c3a ffdf229 14b2c3a e79cbd4 14b2c3a e79cbd4 14b2c3a e79cbd4 14b2c3a e79cbd4 14b2c3a e79cbd4 05ae8b6 29af4dd 8a7cd3d 14b2c3a e79cbd4 d1cc20a e79cbd4 d1cc20a e79cbd4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
license: apache-2.0
library_name: transformers
pipeline_tag: visual-question-answering
---
# CogVLM
CogVLM Grounding generalist model quantized with bitsandbytes 4 bit precision
**CogVLM** is a powerful **open-source visual language model** (**VLM**). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., **surpassing or matching PaLI-X 55B**. CogVLM can also [chat with you](http://36.103.203.44:7861/) about images.
<div align="center">
<img src="https://github.com/THUDM/CogVLM/raw/main/assets/metrics-min.png" alt="img" style="zoom: 50%;" />
</div>
# My env pip list
```base
pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 transformers==4.38.1 accelerate==0.27.2 sentencepiece==0.1.99 einops==0.7.0 xformers==0.0.24 protobuf==3.20.3 triton==2.1.0 bitsandbytes==0.43.0.dev0
```
For triton and bitsandbytes on windows use this files:
```base
pip install bitsandbytes-0.43.0.dev0-cp310-cp310-win_amd64.whl
pip install triton-2.1.0-cp310-cp310-win_amd64.whl
```
# Quickstart
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
model_path = "'local/model/folder/path/here' or 'Rodeszones/CogVLM-grounding-generalist-hf-quant4'"
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).eval()
# chat example
query = 'Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?'
image = Image.open("your/image/path/here").convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
# example output
# a room with a ladder [[378,107,636,998]] and a blue and white towel [[073,000,346,905]].</s>
# NOTE: The model's squares have dimensions of 1000 by 1000, which is important to consider.
```
# (License)
The code in this repository is open source under the [Apache-2.0 license](https://github.com/THUDM/CogVLM/raw/main/LICENSE), while the use of the CogVLM model weights must comply with the [Model License](https://github.com/THUDM/CogVLM/raw/main/MODEL_LICENSE).
# (Citation)
```
@article{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
``` |