My alternative quantizations.
These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
ALL the models were quantized in this way:
bartowski did extensive testing on the input/output wheights. The result was that Q8 performed better and faster than F16.
@Nelathan
these have output and embed at f16 and the rest at q8_0 which is better than pure Q8 (included in the directory) test them yourself.
f16/q6_k is a very good trade-off anyways.
I absolutely don't see a worse performance, the models quantized this way feel better than the normal one (I used to use those).
I even tried again with the old models I was using.. same result (Mistral 8b instruct v03 for example)
Feelings are very deceiving. What you need is hard numbers and measurements.
Feelings are very deceiving. What you need is hard numbers and measurements.
No. That's what you need.
This is getting out of hand, centralize what you have to say in one place please
https://huggingface.co/NeverSleep/Lumimaid-v0.2-12B/discussions/4