Quantization for more than 8 bits?
#25
by
ibalampanis
- opened
How did you manage to quantize the model to 6 bits? I am referring to models named Q6_K and similar.
llama.cpp offers only 8, 16, and 32 bits. Am I mistaken?
Thank you!
@ibalampanis no, llama.cpp supports 1bit, 2bit, 3bit,4bit,5bit,6bit,8bit,16bit and 32 bits.
Most use 4 bit since quality doesn’t degrade noticeably and it’s great speed. Lower then that, quality can start to actually degrade, and 1 bit is trash.
Q6 and q5 are slightly better then 4 bit and the highest you should go. 8 bit is way too slow it’s same quality as q6.
Why can I find only q8_0, f16 and f32 as argument options in llama.cpp? Thank you for your response!
-- UPDATE
I found it. Thank you a lot!
ibalampanis
changed discussion status to
closed