fp16 variant

#1
by Utochi - opened

id like to see the fp16 variant be made available if possible. the 8b fp16 censored version is proving to be quite surprising with its quality (much better than the 8 Q) so im excited to see the uncensored version

Where is the prove that is proving that?

While providing the fp16 variant of small models is something that I indeed might want to do at some point (but my tooling can't provide it automatically yet, as I don't want it for every model), this is not yet on the horizon.

In the meantime, I have not seen anything to substantiate that, other than llama bugs causing issues, the fp16 variant really adds anything. Actual measurements don't really show degredation. Not talking about this model specifically, but I it could well be the quantization itself that causes issues with llama-3, so fp16 would be just as affected.

In some side by side comparisons using Q8 and F16 i noticed a considerable difference in some testing i did. though i think this is very early stages of using this model so i think quantization methods are still being ironed out. models used in my testing were:

https://huggingface.co/PrunaAI/Meta-Llama-3-8B-Instruct-GGUF-smashed/blob/main/Meta-Llama-3-8B-Instruct.fp16.gguf
https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf

test i used was the question

"I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

both models got the question wrong however i noticed the F16 version got a much closer answer each time.

this is by no means a very professional test, was just my observation

Quantization methods are not being ironed out, they are absolutely stable, especially Q8_0, which has not seen any changes for a very, very long time. What is being ironed out with llama-3 is chat templates and special token handling, and possibly tensor handling for new models, all of which will affect fp16 the same negative way as Q8_0. You'd need a test that is objective and deterministic, otherwise you're just going to imagine things.

But again, I do plan for fp16 to be available for smaller models by default, but not at this time without any dependable evidence.

mradermacher changed discussion status to closed

Sign up or log in to comment