I have a few questions for the quantized model quality.

#5
by HannahKim - opened

Thank you so much for this amazing model! I'd like to ask some questions.

Is your 'Meta-Llama-3-8B-Instruct-GGUF' model only just converted into GGUF format from the original 'Mata-Llama-3-8B-Instruct' from Meta? Or is there any more tuning or extra processing?

I wonder:

  1. Is the model quality exactly the same as yours if I convert 'Mata-Llama-3-8B-Instruct' into GGUF format by myself?

  2. I heard that the quantized model mostly shows better quality than the original model. Then, is the 'Meta-Llama-3-8B-Instruct-GGUF' model better than the original 'Meta-Llama-3-8B-Instruct'? Or is it just the same?

Thank you so much for your effort.

If anyone knows the answer, help me out, plz!

The GGUF quants are just static quantizations. However, the IQuants are not; they are much more aggressive and require more work and more complex techniques to achieve the same levels of coherence and de-braindeadedness that the static quants achieve. For this, IQuants use an Imatrix. An imatrix is a “map” of all of a model’s activations over a text corpus (such as wikitext-raw). During the quantization process a pretrained imatrix can be used to help guide the quantization to retain the model’s coherence and abilities
. However, it’s important to note and stress: that the smaller the model becomes due to aggressive quantization, the more likely it is that the model will be less coherent and less capable overall (due to the effect of the current way that aggressively quantizing a model will affect its ability to remain coherent and to retain its capabilities). I hope this helps. 😁

Converting the model to GGUF and quantizing it yourself will yield exactly the same GGUF’ed model. There will be no difference between yours and this repos GGUF’ed model files. The only exception to this, being if you trained and used an imatrix for the model using wikitext-raw vs groups_merged.txt for the quantization process. 🤔

An un-quantized model is always superior in quality and performance to a quantized model. If you want the max performance and the highest throughput for your model, you will always want to run it un-quantized. As a matter of fact, the only time you want to quantize a model is when you don’t have the enough vRAM and RAM to run it un-quantized. 🤔

LM Studio Community org

@Joseph717171 is correct with one tiny addition that for this (and all other models on lmstudio-community besides 70b until it's reuploaded) all of the quant levels are made with imatrix, not just iquants

@Joseph717171 Thank you so much! It helps very well. Hope you have a wonderful day! 😉

HannahKim changed discussion title from I got a few questions for the quantized model quality. to I have a few questions for the quantized model quality.

@Joseph717171 is correct with one tiny addition that for this (and all other models on lmstudio-community besides 70b until it's reuploaded) all of the quant levels are made with imatrix, not just iquants

Can you clarify this for me please, are you saying that Q4_K (as an example) is a 4bit K quant that uses imatrix, but it is NOT an iquant?

Also, thank you 😅😅😅

Yes, that is exactly right! What @bartowski is implying is that all the GGUF'ed quants are made using an imatrix. Which means: all the quantizations are now IQuants. (The imatrix is trained on groups_merged.txt) 😁

LM Studio Community org

It's actually more than anything an unfortunate naming convention and timing

i-quants != imatrix

i-quants are just a newer SOTA quantization technique that borrows ideas from QuIP#, and can be made without an imatrix

https://github.com/ggerganov/llama.cpp/pull/4773

imatrix is an importance matrix that can be used with any quant level, though was originally only targeting i-quants

Then the feature was expanded to target K-quants:

https://github.com/ggerganov/llama.cpp/pull/4930

Sign up or log in to comment