what is the meaning of these suffix

#3
by zenwangzy24 - opened

what is the meaning of these suffix like Q5_K?

image.png

I got the answer from ChatGPT, does it make sense?
Q2_K, Q3_K_L, Q3_K_M, Q3_K_S: These appear to specify a version or configuration of the model. "Q" might stand for "Quarter" or another relevant metric, followed by a number that could indicate a version number or a sequence. "K" might represent a specific configuration or feature, and "L", "M", "S" might indicate different sizes or performance levels (e.g., Large, Medium, Small).
Q4_0, Q4_1: Here, "Q4" might similarly indicate a version of the model, with the following numbers "0" and "1" potentially representing different iterations or variants of that version.
Q5_0, Q5_1, Q5_K_M, Q5_K_S: Similarly, "Q5" represents another version, with "0" and "1" possibly being different iterations, and "K_M" and "K_S" indicating specific configurations or sizes.
Q6_K, Q8_0: These are different version numbers again, with "Q6" and "Q8" potentially marking two different points in a sequence, and "K" and "0" possibly signifying specific configurations or iterations.

They are different levels of quantization.
Smaller Q numbers indicate heavier quantization (i.e. greater quality loss) but with reduced memory usage. K means it's using llama.cpp's K-type quants. For example, Q4_0 is using an older quant method.
The S, M, L (small, medium, large) just means more or lessquantization within that same level (e.g. Q3_K_S is quantized more heavily Q3_K_L).
I'm not an expert in this field but I hope you get the idea.

All the info is here: https://huggingface.co/docs/hub/gguf

When I run this Meta-Llama-3-8B-Instruct.Q6_K.gguf under LM studio, it shows Meta-Llama-3-7B-Instruct.Q6_K.
Why is that? Is that normal?

One thing a really miss about the bloke's uploads was that he provided estimated VRAM usage for each quant type. Is there any way to determine that?

Quant Factory org

@x3v0 not sure about that yet, but will try if we can include those estimations

Quant Factory org

@x3v0 The VRAM can be estimated as the size of the file you want to load + some buffer for context (1-2 GB could be fine). e.g if you want to load Q2_K (3.18 GB) you would need approximately >=4.18 GB of VRAM to run it.
I will try to include these in model description soon

Do you have any recommended version in terms of the tradeoff between quality loss and vram usage?

@cbML I usually go with Q6_K as the default, then in case of any troubles (like not enough VRAM, or too slow inference) I drop to Q5_K_M, Q5_K_S or Q4_K_M.

You can look at PPL drops caused by different quantization methods measured on Llama 2 70B here: https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md

Sign up or log in to comment