turboderp/Mistral-7B-instruct-exl2 · Command used to quantize model

I used this to measure:

python convert.py -nr -i /.../mistral-7b-instruct/ -o /.../ -c /.../wikitext_test.parquet -gr 100 -om /.../measurement.json

Then to convert:

python convert.py -nr -i /.../mistral-7b-instruct/ -o /.../ -c /.../wikitext_test.parquet -gr 100 -m /.../measurement.json -cf /.../mistral-7b-instruct-exl2/4.65bpw/ -b 4.65

Note that -nr clears out the working directory (-o) in either command, so be a little careful with it. Don't do -nr -o ~ or something like that, you'll have a bad day. The -gr 100 requires a bit of VRAM, so you might want to omit it if you have less than 24 GB of VRAM.

As for the longer context length, it's not really critical to consider it when quantizing the model. I find it doesn't have much of an effect to use longer calibration rows, except it increases the overall amount of calibration data. Other people have been experimenting and not measured any difference between -l 2048 -r 100 (which would be the default) and -l 4096 -r 50, for instance. The key thing is that -l specifies the length of each sequence (in tokens) that's forwarded through the model in order to measure and hopefully somewhat mitigate the error introduced by quantization. -r is the total number of such sequences.