Check out an alternate quantization...

by ZeroWw - opened Jun 21, 2024

Discussion

ZeroWw

Jun 21, 2024

•

edited Jul 23, 2024

https://huggingface.co/ZeroWw/NeuralDaredevil-8B-abliterated-GGUF

(you can find more like this in my profile)

These are my own quantizations (updated almost daily).
output and embed tensors quantized to f16.
all other tensors quantized to q5_k or q6_k.

Result:
both f16.q6 and f16.q5 are smaller than q8_0 standard quantization
and they perform as well as the pure f16.

mlabonne

Owner Jun 21, 2024

Hey thanks, can you elaborate on how you managed to improve the performance of these quants?

ZeroWw

Jun 23, 2024

Hey thanks, can you elaborate on how you managed to improve the performance of these quants?

Sure: instead of quantizing everything in the same way, I quantized the output and embed tensors to f16 and all the other tensors to q5,q6 and q8.
The f16/q6 is almost indistinguishable from the pure f16 and it's half as big :D
f16/q5 is smaller and not so degraded as a pure q5.

Obviously these quants are bigger than the "pure" ones but the trade-off is great (imho)

ZeroWw

Jun 23, 2024

if you check on my profile you will find all models quantized in this way.
https://huggingface.co/ZeroWw

mlabonne

Owner Jun 23, 2024

Thanks for the info @ZeroWw ! I just added your link.

HiroseKoichi

Jun 23, 2024

Here's the command you use to quantize it in this way:

llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {input_model_name}.gguf {output_model_name}.gguf Q5_K

ZeroWw

Jun 23, 2024

Here's the command you use to quantize it in this way:

llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {input_model_name}.gguf {output_model_name}.gguf Q5_K

yep. that's what I used and posted. also q6_k is great q4_k will degrade the model too much imho but it's still usable and obviously smaller.

usually a 7B quantized in my way at f16/q6_k runs great on cpu only devices...

ZeroWw

Jun 23, 2024

Hey thanks, can you elaborate on how you managed to improve the performance of these quants?

You're welcome but in your model card you wrote: "GGUF (FP16)" while these are f16/q5_k f16_q6_k and f16_q8_0 (mixed quants... still doesn't have a name) :D

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment