TheBloke/upstage-llama-30b-instruct-2048-GPTQ · Does it increase inference speed on the same gpu?

Jul 27, 2023

Hello! I really like your models. Thank you for making them.
Can using your gptq version increase the inference speed compared to the original model? If used on the same gpu.

jlzhou

Aug 4, 2023

Usually gptq is slower than origin model on inference.

See https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/191#issuecomment-1512571376

TheBloke

Owner Aug 5, 2023

•

edited Aug 5, 2023

It somewhat depends on what GPTQ library is used. I don't recommend using GPTQ-for-LLaMa any more.

ExLlama will be significantly faster than unquantised model. Eg 50 tokens/s vs 10-15 tokens/s on a 30B model on a 4090 GPU.

AutoGPTQ is slower than ExLlama, but still faster than unquantised.

However in production / large scale settings, with multiple requests coming in in parallel, unquantised models can be faster because you can use inference servers that support batching, like vLLM.

But for home/single-user user, you should expect GPTQ to be faster than unquantised, and often significantly faster.

Use ExLlama if you can.

aibarito-ua changed discussion status to closed Aug 9, 2023