Does it increase inference speed on the same gpu?

#1
by aibarito-ua - opened

Hello! I really like your models. Thank you for making them.
Can using your gptq version increase the inference speed compared to the original model? If used on the same gpu.

Usually gptq is slower than origin model on inference.

See https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/191#issuecomment-1512571376

It somewhat depends on what GPTQ library is used. I don't recommend using GPTQ-for-LLaMa any more.

ExLlama will be significantly faster than unquantised model. Eg 50 tokens/s vs 10-15 tokens/s on a 30B model on a 4090 GPU.

AutoGPTQ is slower than ExLlama, but still faster than unquantised.

However in production / large scale settings, with multiple requests coming in in parallel, unquantised models can be faster because you can use inference servers that support batching, like vLLM.

But for home/single-user user, you should expect GPTQ to be faster than unquantised, and often significantly faster.

Use ExLlama if you can.

aibarito-ua changed discussion status to closed

Sign up or log in to comment