Quantization for more than 8 bits?

#25

by ibalampanis - opened Apr 11

Discussion

ibalampanis

Apr 11

How did you manage to quantize the model to 6 bits? I am referring to models named Q6_K and similar.

llama.cpp offers only 8, 16, and 32 bits. Am I mistaken?

Thank you!

YaTharThShaRma999

Apr 11

@ibalampanis no, llama.cpp supports 1bit, 2bit, 3bit,4bit,5bit,6bit,8bit,16bit and 32 bits.

Most use 4 bit since quality doesn’t degrade noticeably and it’s great speed. Lower then that, quality can start to actually degrade, and 1 bit is trash.

Q6 and q5 are slightly better then 4 bit and the highest you should go. 8 bit is way too slow it’s same quality as q6.

deleted

Apr 11

Personally i stick with Q6, however i dont see a huge speed difference with 8, as long as it still fits in my GPU

ibalampanis

Apr 14

•

edited Apr 15

Why can I find only q8_0, f16 and f32 as argument options in llama.cpp? Thank you for your response!

-- UPDATE
I found it. Thank you a lot!

ibalampanis changed discussion status to closed Apr 15

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment