TheBloke
/

Llama-2-70B-Chat-AWQ

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

Performance and latency vs. GPTQ

#3

by krumeto - opened Nov 14, 2023

krumeto

Nov 14, 2023

I know AWQ is expected to be faster with similar quality to GPTQ, but reading through TGI issues, folks report similar latency. At the same time, there is only one AWQ on the LLM Leaderboard (TheBloke/Llama-2-7b-Chat-AWQ) and its score is (way) lower compared to (TheBloke/Llama-2-7B-GPTQ) (I know the base models are different, but it was the closest I could find).

@TheBloke do you happen to have a quality and/or latency experience of AWQ vs. GPTQ? Any insights would be helpful.

Thank you in advance!

YaTharThShaRma999

Nov 14, 2023

Awq has higher latency but it’s really good for using with vllm or tgi and batching. You can get extremely fast speeds and multiple responses.

However exllama v2 with gptq is still considerably faster then vllm in a single response but if you want multiple responses, use vllm + awq

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment