01-ai/Yi-34B · GPU is not fully utilized

Nov 7, 2023

Only 20% GPU (cuda 114)utilized with the following sample code:
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained("01-ai/Yi-34B", device_map="auto", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-34B", trust_remote_code=True)
inputs = tokenizer("There's a place where time stands still. A place of breath taking wonder, but also", return_tensors="pt")
max_length = 256

outputs = model.generate( inputs.input_ids.cuda(), max_length=max_length, eos_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

reedcli

01-ai org Nov 7, 2023

Sorry, but we can't do much about that...
If you want to get high GPU utilization, maybe you can infer with bigger batchsize?

lxglbk

01-ai org Nov 8, 2023

This is just a sample code for inference, consider to switch to another inference engine(like vllm) if you need to run the model efficiently~

Yhyu13

Nov 14, 2023

•

edited Nov 14, 2023

@reedcli @lxglbk @cailuyu

I can confirm exllamav2 has reached around 90% GPU utilities with flash-attention2 cuda 118 torch2.0.1 on A5000.

The HF transformer always poor for almost any model due to not specialized for cuda

Are you guys partnered with exllamav2?

YaTharThShaRma999

Nov 14, 2023

@Yhyu13 no lol, exllama 2 is created by turboderp and other contributors also help as well. Exllamav2 is specifically designed for fastest single batch inference and it’s infact the fastest one probably.

Vllm is better for batching but most people are not gonna input tens or a hundred input prompts.

FancyZhao changed discussion status to closed Nov 16, 2023