Question about quantization settings

#1
by AUTOMATIC - opened

Hi. There are discussions on our forum about how your quants do not perform too well on long contexts. The stated reason is that you use a default context length of 2048 when quantizing. Looking at the docs in https://github.com/turboderp/exllamav2/blob/master/doc/convert.md, I guess it is a --length parameter. Is that the case? Do you have any info about whether this setting has noticeable effect during inference?

@LoneStriker

In previous discussions with Turboderp on TheBloke's Discord, the length of the calibration did not make a difference in the quality (measured by perplexity) of the resulting models. The lower quant models tend to do a bit worse in terms of going off the rails (not staying coherent). But, if there's a certain context length when the models become incoherent, it's usually down to the model itself and not the quant. What context does this model start breaking down for you? I'll load this model at a higher quant and test; at 6.0+ quant levels, the quantization itself should have little to no impact.

I was largely fearmongered into using interventis, so I can't say with certainty whether it's true or not. I didn't have any visible anomalies with this checkpoint when I tested it. Generally, this is about low bpp quants of mixtral. 3.5bpp is what fits into my RTX3090. To use any larger for mixtral you'd need overpriced enterprise cards.

There is not much science to the exl2 quants at this stage; if you largely stay with the defaults, one quant is going to be similar to the next other than the bpw. In general, you do not want to stray from the defaults unless you have very good reason to do so; the gains would be minimal or possibly downgrade the quantized model. From the measurements that many have done, 3.5bpw seems to be an inflection point where you get a nice jump in quality from the lower bpw models.

If you want to run just slightly higher bpw quants, you can add a second GPU into your system (if you have the space, PCIe slot, and power plugs). Adding an 8 GB or 12 GB NVidia GPU will get you to the 4.0bpw models. 2x 3090s will get you to 6.0 and 7.0 bpw Mixtral models or 4.0+ bpw 70B models.

The 6.0bpw model seems to be very good. Tested out past 10K context without issues (other than one reply that stopped early that I had to tell it to continue at around 5K context.) The 3.5bpw model also seems to be coherent. I continued my conversation past 10K context and it responded as well as the 6.0bpw model.

I loaded the longest RP I had, 21k tokens, and asked the character questions about events from the beginning and the middle. Tested on LoneStriker_Mixtral-8x7B-Instruct-v0.1-3.5bpw-h6-exl2 and intervitens_Mixtral-8x7B-Instruct-v0.1-3.5bpw-h6-exl2-rpcal. Both checkpoints mostly answered correctly (sometimes making mistakes in small details), and both were able to continue the chat without degradation. It looks to me like the fears are unfounded then.

Thanks for your work.

Sign up or log in to comment