Please check these quantizations.

#40
by ZeroWw - opened

I don't have enough resources to run all tests, but I came up with a slightly different way to quantize models.

As you will see, the f16.q6 and f16.q5 are smaller than the q8_0 and very similar to the pure f16.

https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF/tree/main

This is how I did:

echo Quantizing f16/q5
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q5.gguf q5_k $(nproc)
echo Quantizing f16/q6
./build/bin/llama-quantize &>/dev/null --allow-requantize --output-tensor-type f16 --token-embedding-type f16 ${model_name}.f16.gguf ${model_name}.f16.q6.gguf q6_k $(nproc)
echo Quantizing q8_0
./build/bin/llama-quantize &>/dev/null --allow-requantize --pure ${model_name}.f16.gguf ${model_name}.q8.gguf q8_0 $(nproc)

I quantized the output and embed tensors to f16 and the other ones to q5_k or q6_k.
If someone could test them better it would be great.

P.S.
even the f16/q5 is not that different from the pure f16. And way better than the q8_0.

Please start posting some side-by-side comparisons, we really need to see how the model is different, no sense asking for it everywhere without proof that there's a difference

Sign up or log in to comment