Question about group size

#1
by nephepritou - opened

There is an opinion about group size 64 or 128 must give better quality. I wonder why 32 was chosen and how I (or anyone else) can reproduce same quantization process with different group size? Is it possible? Will it be a waste of time? Or is it just cost too much despite being slightly better?

cyankiwi org

Hi @nephepritou , yes, a quantization group size of 64 and 128 is possible using llmcompressor. 32 is chosen as it is tested to give higher quality, but larger quantized model size.

In addition, and particularly to this model, tensor-parallel-size can be 2 without enable-expert-parallel at the quantization group size of 32.

There is an opinion about group size 64 or 128 must give better quality. I wonder why 32 was chosen and how I (or anyone else) can reproduce same quantization process with different group size? Is it possible? Will it be a waste of time? Or is it just cost too much despite being slightly better?

AWQ / GPTQ quantization reduces model size by grouping weights and mapping them to lower-precision values using scaling factors.
A smaller group size results in higher accuracy because quantization parameters (scale and zero-point) are calculated for fewer weights at a time.
This allows the quantization grid to adapt more tightly to local value distributions and outliers.
By fitting the specific numerical range of a small cluster more precisely, the difference between the original and quantized weights (quantization error) is minimized, preserving more of the model's original accuracy.

Sign up or log in to comment