How to run fp16?

#5
by grim3000 - opened

Hi, I'm trying to run the using 4x T4 (16gb each), but encountering the error: bf16 is only supported on A100+ GPUs

While I wait for a quota increase to access A100's or 2x A10's, I'm curious how this model can be run with fp16 instead? I've seen some mentions of this in other discussions and on the Github repo but no clear examples

Separately, are there any plans to update this repo so that the model can be easily deployed on HF inference endpoints? At the moment, it seems to require setting up a custom handler among other things

Any help is appreciated!

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

just change all torch.bfloat16 to torch.float16 in example.

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
    'THUDM/cogvlm-chat-hf',
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to('cuda').eval()


# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.float16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))
chenkq changed discussion status to closed

Sign up or log in to comment