4 bit vs 8 bit

#2
by Doomed1986 - opened

I read on local llama reddit that there's no benefit of 8bit over 4bit due to quantisation being so effective. Maybe im confusing things though, that its really 4bit & high xxB is better than 8 bit & low xxB.
I have 24gb vram and the 8bit branch here is 20-ish, so im leaning in that direction unless the highest inference quality 4bit is actually equal in performance..?

@Doomed1986 well there is a difference between 8 and 4 bit in quality but there is a much larger difference between a 4 bit 13b model and 7b 8 bit model.

Simply put it’s better to run a bigger more quantized model(limit is q3 or Sota q2 quants) instead of a smaller less quantized or even full precision.

Here are perplexity charts to prove this
(Lower is better and perplexity is basically an artifical benchmark)
image.png

@Doomed1986 well there is a difference between 8 and 4 bit in quality but there is a much larger difference between a 4 bit 13b model and 7b 8 bit model.

Simply put it’s better to run a bigger more quantized model(limit is q3 or Sota q2 quants) instead of a smaller less quantized or even full precision.

Here are perplexity charts to prove this
(Lower is better and perplexity is basically an artifical benchmark)
image.png

Thanks for your reply, so there is a quality improvement with 8bit but a bigger model is preferred over less Q. Could i trouble you for ideal loader & settings for 8bit, as its not exllama and im having trouble getting any generation faster than 150-220 seconds with AutoGPTQ or Transformers using a rtx 4090. (not this model, 13.5gb psyfighter, if i can get good settings id use other models in 8bit like this one.)

@Doomed1986 no i think i kinda phrased it wrong.
basically try to fit the biggest model with any quantization above q3(like 4 bit or 3 bit) or llama.cpp's sota q2 quant

I would recommend llama.cpp for inference as its only very slightly slower then exllama v2 but it has higher quality quants.
Also, it can offload to cpu, and you can input much larger contexts as well.

With llama.cpp, and by downloading the sota q2 quant.
You can probably run a 70b model at pretty high context.

You could also run mixtral but it might be hard to find a good finetuned version for rp. mixtral is usually best for more common tasks

@Doomed1986 no i think i kinda phrased it wrong.
basically try to fit the biggest model with any quantization above q3(like 4 bit or 3 bit) or llama.cpp's sota q2 quant

I would recommend llama.cpp for inference as its only very slightly slower then exllama v2 but it has higher quality quants.
Also, it can offload to cpu, and you can input much larger contexts as well.

With llama.cpp, and by downloading the sota q2 quant.
You can probably run a 70b model at pretty high context.

You could also run mixtral but it might be hard to find a good finetuned version for rp. mixtral is usually best for more common tasks

Thanks. Yeah i understand you, im just having trouble comprehending the chart, probably because I'm dumb! (and bits aren't labelled). But basically the take away is 'get the bigger model'!
I'm looking into 2bit sota now. Found this https://docs.google.com/spreadsheets/d/18woLrIBdVGUr9CuFDbK9pl_6QzEBl09hfnoe4Qkg7Hg/edit#gid=0.
I dont want to offload if i can help it because wont it be really slow? (R9 5900x) As for context, i stick to the model default as i heard rope scaling impacts inference.

Sign up or log in to comment