Accuracy differences for models with quantization

#12
by kmcgrath - opened

Hello! Thank you so much for your work on this leaderboard.

I was curious, should we expect the accuracy of a model that has been quantized to have the same Open LLM Score as a model that has not had any optimizations applied to it? I would assume that the accuracy / latency tradeoff will be high enough to make a difference on this chart.

I was not sure if you had this planned, but would love to help contribute if any help is needed.

Thanks!

I see in the about section you use the scoring for the "best" form of the LLM regardless of weight class, so that makes sense. I would be interested in plans or thoughts on this methodology, mainly if the percentage difference in LLM scores between these optimizations is negligible enough to represent the goal of this chart. But otherwise please feel free to close this, thanks!

Same concern here. IMO it is better for the leaderboard score to be AFTER all optimizations mentioned, so that comparing them is fair. If we just compare them using token/s but still take the same eval score from w/o optimization, then it is unclear. For scores w/o optimizations, they should go to Open-LLM leaderboard.

Hugging Face Optimum org
โ€ข
edited Sep 4, 2023

Sorry for the late response, this project doesn't have as much of the compute budget as the evaluation leaderboard.
optimum-benchmark, the engine of this leaderboard, only measures hardware/backend/optimization/etc performance and tries to do so by using as little resources (especially time) as possible.
For example, optimum-benchmark doesn't even download the model's real weights, while guaranteeing the same latency/throughput/memory/etc of the real model.
For the score, my intuition was based on charts from the quantization papers which demonstrated minimal output quality degradation.
Fortunately, this hypothesis is now confirmed by the evaluation of quantized models on th Open LLM Leaderboard ; the general rule seems to be a <1% degradation in the average LLM score for both BnB and GPTQ 4bit.
I will report that in the introduction text.
bnb.png
gptq.png

personally for me, GPTQ 4 bit has very very obvious degradation in zero shot text classification, compared to GGUF/GGML Q5_K and above. It is never able to even follow my instruction to give a classification response, not to mention getting the classification right or wrong.
Here is the prompt:

You are an experienced review moderator. You filter and detect spam, inappropriate and non-sincere reviews, or those that are not helpful to potential buyers. You are being deployed on an e-commerce platform.
The text to be reviewed will be passed in as a JSON like {"text": string} and you will return a JSON that classifies this review as helpful or not helpful (spam or inappropriate). If it is helpful, return {"result": 1}, if it is not helpful, return {"result": 0}. Do not include anything else in your response, and follow strictly to the instructions given.
Review:
{"text": "this product is simply bad!!"}
Classification:

GGML/GGUF Q5_K and above on all 6 llama2-13B based models i tested can correctly answer 0 in the format I wanted, but using GPTQ 4 bit with group 32 (highest quality within 4 bit quants) always give either wrong answer (1) or completely not related like a random continuation of the review.

Hugging Face Optimum org

Interesting!
Are you using the transformers / auto-gptq integration ? and what dataset did you use for calibration ?

Im actually using thebloke's quants with exllana

Could we develop a more detailed comparison table for the quantized models? The current tables use the same evaluation metrics for all original/quantized/optimized models with the same name. This makes it challenging to discern the trade-offs between performance and accuracy, potentially rendering the leaderboard of limited usage.

image.png

For example, the following figures make me believe that those five quantization methods only differ at the hundredth decimal level, which is unfortunately unreal.
image.png

Any update?
@IlyasMoutawwakil

Hugging Face Optimum org

sorry for the mate reply, didn't see this one, the ** means that this is not the actual score, and not "differ at the hundredth decimal level". I will write it down somewhere.
I would love to benchmark output quality but unfortunately I don't have the compute budget for that (it needs a lot) ๐Ÿ˜…. Maybe I can add perplexity since now we only have pretrained models, but from what I heard it's not a very good metric, it does however allow to compare quantization methods (for the same model).

Sign up or log in to comment