The inference speed is too slow on 4*A10G GPU

#47
by PierrePeng - opened

As the title mentioned above, the inference speed is about 1 word per second. There is enough GPU RAM and all the weights are loaded on GPU without offload. But the utility of the GPU is only up to about 40%.

I wanna if the inference speed is normal based on my hardware resource.
If not, what's the reason and how can I improve the speed?
If yes, any recommendations for the hardware resources?

b902f531dc92e78b4feeedb114350f8.png

39859f7fef57c341efa87d8fa89bbdf.png

1ed3601a7aef010ea5f2994385d9fde.png

72f3491699048d69e4dba86113b2ed3.png

PierrePeng changed discussion status to closed

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

Following the tutorial in the link will meet your question.

https://github.com/huggingface/text-generation-inference

Hi. I'm facing the same issue. Do you have any tips that can speed up the model inference? Thanks!

Following the tutorial in the link will meet your question.

https://github.com/huggingface/text-generation-inference

Thank you! Let me take a try.

I am facing the same problem, GPU usage is only up to about 50%.
I use 2 3090 to load the 13B LLM model, the model is evenly distributed across gpus.
image.png

Sign up or log in to comment