Spaces:
Running
on
A10G
Please support this method:
Please add a flag so that the embed and output tensor can be quantized in a way and all the other tensors in another.
It would be very useful for tests.
P.S.
In llama.ccp there is that option.
Hi @ZeroWw - can you please ping me to this option? and an example of what you mean?
@reach-vb
go check my model directory in my profile.
all those models have been quantized by keeping the output and embed tensors to F16 and the other tensors to q5_k,q6_k or q8_0.
The result is that all models don't seem at all degraded to me (even if 20% bigger) but for example f16/q6 is way better than a pure q8 and way smaller.
@reach-vb please add that feature.
Heya! It would perhaps be neat long-term to have an advanced settings menu with misc. stuff like that, maybe for this one we can have another checkbox in there for using this particular flag in process model :)
if allow_requantize:
if use_imatrix:
quantise_ggml = f"./llama.cpp/llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 --imatrix {imatrix_path} {fp16} {quantized_gguf_path} {imatrix_q_method}"
else:
quantise_ggml = f"./llama.cpp/llama-quantize --allow-requantize --output-tensor-type f16 --token-embedding-type f16 {fp16} {quantized_gguf_path} {q_method}"
else:
Also, a nice feature would be to optionally output the model also in safetensor format (but quantized) or a tab with GGUF >> safetensors.
@ZeroWw - I see, how do you put the output & embeds in F16?
cc: @SixOpen - thoughts on this? do you think it's worth adding this? ๐ค
Sorry, I missed that question. SixOpen answer is the one I would have given.
About worthiness it has yet to be seen, but I see no or very little degradation even at q5_k
Would be interesting to try also f16/q4 or q8/q6.
In other words: ONE quantization type for output and embed and one for the others.
the output tensor is basically "how the model express itself", while the embed tensor is "how it understands and abstracts". everything in the middle are let's say "thinking processes from input to output.
If, for a test, you quantize embed/output at Q4 and the rest at f16, you will get a model that feels lobotomized and speaks like a braindamaged child.
if you do the opposite (what I do) instead everything works better.
After a few empyrical tests I found that f16.q5 is smaller, faster and more clever (less degraded) than a pure Q8.
Note: this might also help to experiment by quantizing different tensors in different ways to achieve faster and smaller models perhaps slightly dumber in areas you won't care (like imatrix does mostly)