GPU is not fully utilized

#13
by cailuyu - opened

Only 20% GPU (cuda 114)utilized with the following sample code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("01-ai/Yi-34B", device_map="auto", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("01-ai/Yi-34B", trust_remote_code=True)
inputs = tokenizer("There's a place where time stands still. A place of breath taking wonder, but also", return_tensors="pt")
max_length = 256

outputs = model.generate(
inputs.input_ids.cuda(),
max_length=max_length,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

01-ai org

Sorry, but we can't do much about that...
If you want to get high GPU utilization, maybe you can infer with bigger batchsize?

01-ai org

This is just a sample code for inference, consider to switch to another inference engine(like vllm) if you need to run the model efficiently~

@reedcli @lxglbk @cailuyu

I can confirm exllamav2 has reached around 90% GPU utilities with flash-attention2 cuda 118 torch2.0.1 on A5000.

The HF transformer always poor for almost any model due to not specialized for cuda

Are you guys partnered with exllamav2?

@Yhyu13 no lol, exllama 2 is created by turboderp and other contributors also help as well. Exllamav2 is specifically designed for fastest single batch inference and it’s infact the fastest one probably.

Vllm is better for batching but most people are not gonna input tens or a hundred input prompts.

FancyZhao changed discussion status to closed

Sign up or log in to comment