brucethemoose/Yi-34B-200K-RPMerge-iMat.GGUF · Can i get a tldr of what exactly "imatrix quantization" is?

Heh, all this stuff in LLM land is horribly documented, and my meager model card is not helping. See: https://github.com/ggerganov/llama.cpp/pull/4861

In a nutshell, iMatrix quantization uses example text to "calibrate" the quantization and allocate more space to useful parts of the model. It should be an all around upgrade over regular GGMLs.

It is usable in the llama.cpp repo through the ./imatrix command. The process takes some time (hours, depending on how many layers you can offload to the GPU).

In addition, the IQ3XSS and IQ2XS files use a new, experimental quantization technique that should result in much better quality than the similarly-sized Q2 quantizations, at the cost of some inference speed.

This quantization, on this huggingface page, is very experimental because it uses nonstandard data and an abnormally long quantization context length. This might help over regular imatrix quantization (particularly at long context), or it may mess the model up. shrug.