Performance and latency vs. GPTQ

#3
by krumeto - opened

I know AWQ is expected to be faster with similar quality to GPTQ, but reading through TGI issues, folks report similar latency. At the same time, there is only one AWQ on the LLM Leaderboard (TheBloke/Llama-2-7b-Chat-AWQ) and its score is (way) lower compared to (TheBloke/Llama-2-7B-GPTQ) (I know the base models are different, but it was the closest I could find).

@TheBloke do you happen to have a quality and/or latency experience of AWQ vs. GPTQ? Any insights would be helpful.

Thank you in advance!

Awq has higher latency but it’s really good for using with vllm or tgi and batching. You can get extremely fast speeds and multiple responses.

However exllama v2 with gptq is still considerably faster then vllm in a single response but if you want multiple responses, use vllm + awq

Sign up or log in to comment