My alternative quantizations.

by ZeroWw - opened Jul 27, 2024

Discussion

ZeroWw

Jul 27, 2024

•

edited Jul 27, 2024

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
ALL the models were quantized in this way:

Nelathan

Jul 27, 2024

bartowski did extensive testing on the input/output wheights. The result was that Q8 performed better and faster than F16.

ZeroWw

Jul 27, 2024

@Nelathan these have output and embed at f16 and the rest at q8_0 which is better than pure Q8 (included in the directory) test them yourself.
f16/q6_k is a very good trade-off anyways.
I absolutely don't see a worse performance, the models quantized this way feel better than the normal one (I used to use those).
I even tried again with the old models I was using.. same result (Mistral 8b instruct v03 for example)