Note to others trying to run this

#2
by lzxcgnkhnrlnto - opened

The HF version still requires at least 40gb VRAM and my attempts so far to split it across two 3090s have failed.
There's also no requirements file, leaving you guessing which pytorch, einops, transformers, and sentencepiece to use.

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM

Yes, you are right. The huggingface version does not support model parallel, and we suggest use the official sat version: https://github.com/THUDM/CogVLM

If you have the time, consider checking this issue as it is the primary one keeping dual-gpu users from using CogVLM on WSL2: https://github.com/THUDM/CogVLM/issues/56

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

It seems like a problem of WSL2 and torch multi-gpu support... I have no idea... sorry

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited Nov 21, 2023

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer')
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))
Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

also, thanks for the reminder. the requirement is added in README

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.


This works in WSL2 with two gpus, thank you!
CogVLM is the best captioner out there and to finally get this to run is a great relief.
(And, I see you've already added this as an example, great work ^^ )

chenkq changed discussion status to closed

if u have two 24GB devices, u can use accelerate to dispatch model as demonstrated in the following. it seems that the load_checkpoint_and_dispatch function does not support remote Hugging Face model paths like 'THUDM/cogvlm-chat-hf', the local path for the model ckpt is needed. I have personally tested this code on my own device, and observed that the peak GPU usage reached approximately 22GB.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )
device_map = infer_auto_device_map(model, max_memory={0:'20GiB',1:'20GiB','cpu':'16GiB'}, no_split_module_classes='CogVLMDecoderLayer')
model = load_checkpoint_and_dispatch(
    model,
    'local/path/to/hf/version/chat/model',   # typical, '~/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/balabala'
    device_map=device_map,
)
model = model.eval()

# check device for weights if u want to
for n, p in model.named_parameters():
    print(f"{n}: {p.device}")

# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

Anyone tried to deploy cogvlm (4bit quantization) on multiple GPUs with accelerate?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited Dec 11, 2023

@2thousand see if this can help

@2thousand see if this can help

Thanks, I just figured it out. we can directly add device_map="auto" in AutoModelForCausalLM.from_pretrained()

tokenizer = LlamaTokenizer.from_pretrained('vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
        'THUDM/cogvlm-chat-hf',
        load_in_4bit=True,
        trust_remote_code=True,
        device_map="auto"
    ).eval()
query = 'Describe this image in details.'
image = Image.open('image-path').convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.float16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

can someone create a web demo version of this? I tried adapting the cogvlm web demo using the accelerate code above to allow multi-gpu support in wsl2, but couldn't get it to work.
has anyone gotten a gradio UI version of cogvlm working in wsl2?

Sign up or log in to comment