Incredibly slow inference speed

#2
by famunir - opened

I am trying to test the Qwen/Qwen1.5-72B-Chat-AWQ model for performing inference. However, it is incredibly slow. Other quantized models, such as 8-bit model, are very slow as well. I am using the same setup that I used for running the Qwen/Qwen1.5-72B-Chat model and its speed is quite alright. I am using 2 Nvidia A-100s 80G. Is there any particular reason for this behavior?

try bitsandbytes 4 bit instead. I have comparatively satisfying speeds with that. I also think that gguf is faster.

Sign up or log in to comment