Quantization for more than 8 bits?

#25
by ibalampanis - opened

How did you manage to quantize the model to 6 bits? I am referring to models named Q6_K and similar.

llama.cpp offers only 8, 16, and 32 bits. Am I mistaken?

Thank you!

@ibalampanis no, llama.cpp supports 1bit, 2bit, 3bit,4bit,5bit,6bit,8bit,16bit and 32 bits.

Most use 4 bit since quality doesn’t degrade noticeably and it’s great speed. Lower then that, quality can start to actually degrade, and 1 bit is trash.

Q6 and q5 are slightly better then 4 bit and the highest you should go. 8 bit is way too slow it’s same quality as q6.

deleted

Personally i stick with Q6, however i dont see a huge speed difference with 8, as long as it still fits in my GPU

Why can I find only q8_0, f16 and f32 as argument options in llama.cpp? Thank you for your response!

-- UPDATE
I found it. Thank you a lot!

ibalampanis changed discussion status to closed

Sign up or log in to comment