Intuition for quality decrease after quantization
My intuition has been that larger models lose relatively less quality after quantization vs. smaller models (e.g. Llama 2 70B in 4bits would be closer to the original precision model than Llama 2 7B in 4bits to its original precision model).
Do you have any insights if that intuition holds for a MoE?
If during inference only 2 of the 7B experts are active based on the above, I'd expect the quality loss after quantization to be relatively higher than, say a 45B non-MoE quantized model.
Thank you in advance!
cc @marcsun13 who worked on the quantization!
Hi @krumeto , this is right. We've seen a decrease in quality loss comparable to a quantized LLama 7B.
Thank you, @marcsun13 ! Since I asked the question, first Open LLM Leaderboard results for the base GPTQ version appeared. The decrease seems to be more or less similar to what we saw with Llama 2 models:
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|---|
mistralai/Mixtral-8x7B-v0.1 | 68.42 | 66.04 | 86.49 | 71.82 | 46.78 | 81.93 | 57.47 |
TheBloke/Mixtral-8x7B-v0.1-GPTQ | 65.7 | 65.19 | 84.72 | 69.43 | 45.42 | 81.14 | 48.29 |
Score Delta | 0.960 | 0.987 | 0.980 | 0.967 | 0.971 | 0.990 | 0.840 |
This is great news for us (waiting for the instruct model GPTQ scores, but in general, I hope this holds). We are testing the model with TGI (in 8bit, eetq), waiting to test GPTQ (seems like there are still some TGI issues with GPTQ), but not quite sure which of the methods should retain most quality (we are less interested in speed). If you have any resources that compare Mixtral (or even other models) any of EETQ/GPTQ/AWQ/bnb in terms of quality, it would be very helpful. This blog was already extremely insightful - https://huggingface.co/blog/overview-quantization-transformers#overview-of-natively-supported-quantization-schemes-in-%F0%9F%A4%97-transformers
Thank you all!
Hi
@krumeto
, thanks for the awesome feedback. We are still working on the AWQ quant since the quality is not good enough for now. For bnb, the quality should be the same. As for GPTQ, the model that was tested is not the best gptq quant. You can test the following branch which should give better results: gptq-4bit-128g-actorder_True
or gptq-4bit-32g-actorder_True
with 32g being the most accurate one. However, the vram consumption will increase since these quant needs to store more quantization statistics (128g version with an additional 1go and 32g version with an additional 3.5 Go)
More details: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ