for faster GPU inference

by harithushan - opened

can anyone provide a code for faster gpu inference from GPTQ model, for me it takes around 2 mins to get the response

I have the same problem :c

using exllama or exllama v2 should greatly help as its the fastest single user inference repository so far i believe. Also, using llama.cpp might help if youre gpu is really old or you want to split it to cpu as well

Sign up or log in to comment