Beinsezii/ReMM-v2.2-L2-13B-EXL2

Quantizations for Undi95/ReMM-v2.2-L2-13B in the EXL2 format

Quant	Mem 4k	Mem 8k	wikitext	pippa1k	wtppnr-test	ARMS	ARMS 8k
fp16	a lot	a lot	6.19	7.53	11.77	3.48	N/A
h8_b8	15.8GB	17.4GB	6.24	6.94	11.58	3.44	3.13
h6_b5	11.1GB	12.8GB	6.26	6.93	11.26	3.45	3.13
h8_b3.5	8.9GB	10.5GB	6.54	6.94	11.31	3.52	3.26
h6_b3.5	8.9GB	10.5GB	6.57	6.94	11.71	3.52	3.26
h6_b3.5_default	8.9GB	10.5GB	6.51	7.01	11.31	3.62	3.37
h6_b3.5_rand	8.9GB	10.5GB	6.81	7.19	11.73	3.75	3.46
h6_b2.4	7.2GB	8.9GB	9.60	7.46	15.91	4.51	4.50

Information

All quantizations were measured and quantized using the appropriate files. These custom sets are a 1/3 1/3 1/3 mixture of wikitext2, pippa, and no robots (herein "wtppnr") each with sizes adjusted to the default exl2 measurement and quantization lengths @ 4k

The quantization was done with default settings, 4096 rows and an NTK alpha of 2.5

VRAM estimates are performed with an extremely long chatlog in oobabooga webui on a 7900 XTX using nvtop to monitor pytorch usage only. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with flash attention 2 will use less VRAM than otherwise estimated.

The measurement files and parquets are provided in the main branch so you can make your own quants at other bit depths without going through 20-30 minutes of measuring

Analysis

I was curious about the effect a calibration dataset has on quantization after observing other huggingface users' quantizations outperforming their fp16 counterparts in given perplexity scores. This lead me to create the wtppnr calibration files in attempt to reduce overfit and benchmark contamination.

Running through the benchmarks we have the following:

wikitext aka wikitext-v2-raw using a higher stride because I'm lazy. Hypothetically this is something like 0.2% contaminated as I used a few hundred kilobytes from it in wtppnr
pippa1k Is simply the first 1000 rows from the folded pippa chat parquet. This is 100% contaminated as the quantization set uses > 1000 rows
wtppnr-test Is the same size as wtppnr-measure except indexing from the end to avoid data contamination
ARMS A private evaluation set consiting of a cherrypicking from my own logs, presented with full context as it would appear to the model during inference
ARMS 8k ARMS again but evaluated with full 8k context. This dataset is small enough it only takes 5 minutes per model...

Update: Additionally, I re-made the h6 b3.5 with two different datasets: random tokens and the exl2 default. I also made 2.4b and 5b randoms, but the 2.4b exploded and the 5b wasn't much different from the 5b wtppnr.

So, the numbers...

...are all over the place.

wikitext shows them in order of depth as expected. pippa1k clearly shows contamination as expected. But then it gets weird with wtppnr-test seemingly in an arbitrary order. Why is 8 bit and fp16 in the middle? Who knows. Maybe it's just variance from the good ole AMD Memory Access Faults™

Finally my homebrew ARMS dataset shows some interesting results, suggesting that it benefits from the fitting enough to offset the quantization losses for the first few bit levels.

Update: It seems the random token for maximum generalization theory is defunct as the model was just bad. The default however was close to my wtppnr dataset and actually surpassed it in full wikitext. I'm not even going to acknowledge it getting the highest wtppnr-test score as those numbers are just outright schizophrenic.

What are the conclusions?

To be honest, not many. The TL;DR is that advanced quantization quickly becomes a form of mini-training and should be treated as such.

Pick and curate the calibration. Don't benchmark against possible contaminated sets.
On paper the 8 vs 6 bit header doesn't seem that important despite what I've seen where users do 6.5bpw with 6 headers while others insist on using 8 headers all the way down to 2.5 bpw
It's possible that overfitting is more of an issue on lower bit depths, leading to the jumbled mess that is the wtppnr-test set as all my calibrations were done with twice the default rows.
- Very low bit depths are extremely sensitive in general. Notice all of the quant levels improved greatly at 8k context except the 2.4bpw. Update:
Don't use random tokens even if it seems tempting.
The builtin exl2 dataset looks pretty decent. It must be new, as I didn't even realize it existed until after I made wtppnr.
- Given how little difference datasets seem to make at 5bpw+, I would personally just use the exl2 default settings for 20B and smaller models.