Incredibly slow inference speed

by famunir - opened May 6, 2024

May 6, 2024

I am trying to test the Qwen/Qwen1.5-72B-Chat-AWQ model for performing inference. However, it is incredibly slow. Other quantized models, such as 8-bit model, are very slow as well. I am using the same setup that I used for running the Qwen/Qwen1.5-72B-Chat model and its speed is quite alright. I am using 2 Nvidia A-100s 80G. Is there any particular reason for this behavior?

KnutJaegersberg

Qwen org May 10, 2024

try bitsandbytes 4 bit instead. I have comparatively satisfying speeds with that. I also think that gguf is faster.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment