The inference speed is too slow on 4*A10G GPU

#47

by PierrePeng - opened May 15, 2023

May 15, 2023

As the title mentioned above, the inference speed is about 1 word per second. There is enough GPU RAM and all the weights are loaded on GPU without offload. But the utility of the GPU is only up to about 40%.

I wanna if the inference speed is normal based on my hardware resource.
If not, what's the reason and how can I improve the speed?
If yes, any recommendations for the hardware resources?

PierrePeng changed discussion status to closed May 25, 2023

Yufei

Jun 15, 2023

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

PierrePeng

Jun 16, 2023

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

Following the tutorial in the link will meet your question.

https://github.com/huggingface/text-generation-inference

Yufei

Jun 16, 2023

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

Following the tutorial in the link will meet your question.

https://github.com/huggingface/text-generation-inference

Thank you! Let me take a try.

Icyrockton

Jun 17, 2023

I am facing the same problem, GPU usage is only up to about 50%.
I use 2 3090 to load the 13B LLM model, the model is evenly distributed across gpus.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment