Beinsezii/MythoMax-L2-13B-EXL2

Quantizations for Gryphe/MythoMax-L2-13b in the EXL2 format

Quant	Mem 4k	Mem 4k8	Δp	test wiki	test pippa
4k_hb8_b8_pippa	17.2GB	15.7GB	-0.0183	5.7781	4.4221
4k_hb6_b5_pippa	12.6GB	11.0GB	-0.0197	6.0141	4.4252
2k_hb8_b8_pippa	17.2GB	15.7GB	-0.0246?	5.7762	4.4238
2k_hb6_b5_pippa	12.6GB	11.0GB	-0.0121	6.0823	4.4363
4k_hb8_b8	17.2GB	15.7GB		5.7459	4.4247
4k_hb6_b6	15GB
4k_hb6_b5	12.6GB	11.0GB		5.7699	4.4514
2k_hb8_b8	17.2GB	15.7GB		5.7572	4.4242
2k_hb6_b4.125

Breaking down the names:

4k is calibrated with 4096 length as opposed to the default 2048
hb8 is a header depth of 8 bits
b8 is a model weight average of 8.0 bits
pippa is calibrated with pippa-llama2-chat instead of wikitext

Additional analysis:

Δp | (quant_perplexity - base_perplexty) as reported in stdout during quant creation. Unsure if useful
test wiki | Perplexity reported by test_inference.py against wikitext with 4096 length
test pippa | Perplexity reported by test_inference.py against pippa chat with 4096 length

(Possible) Takeaways:

The difference between 2k and 4k 8bit calibration is about 0.01% averaged across both tests for pippa calibration and 0.1% for wiki calibration. Wiki 4k was also measured with 4x the row count which likely is part of the variance.
4k 5bit pippa is substantially worse than 2k 5bit pippa relative to 4k 8bit pippa and 2k 8bit pippa. Larger calibration lengths (or possibly at least more rows) is likely strongly preferable for aggressive quants
Dropping from 8 to 5 bits results in a 4% increase in wiki perplexity on the pippa calibrated quants while only resulting in 0.2% for the wiki calibrated
Both calibrations increase <0.1% on 8 to 5 bits when tested against pippa chat.

Detractors:

Real world usage may produce more visible differences than a hundreth of a percent on a small test
I have not tested using more rows over using a greater length
I have not tested increasing row count on the final quant for the same measurement
It's unclear if or how Δp should be used

With this rather superficial methodology, wikitext with 4k settings seems like the safest general purpose quant speed/quality tradeoff. However, real world usage would probably favor picking datasets more related to the tune; all pippa calibrated sets performed fairly better on pippa test than their wiki calibrated counterparts.

All quantizations were calibrated with wikitext-2 unless otherwise specified

You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.

VRAM estimates are performed with an extremely long chatlog in oobabooga webui on a 7900 XTX using nvtop to monitor pytorch usage only. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with flash attention 2 will use less VRAM than otherwise estimated.

The measurement files are provided in the main branch so you can make your own quants at other bit depths without going through the 2-3 hours of measuring.