Quantizations for Undi95/ReMM-v2.2-L2-13B in the EXL2 format
Quant | Mem 4k | Mem 8k | wikitext | pippa1k | wtppnr-test | ARMS | ARMS 8k |
---|---|---|---|---|---|---|---|
fp16 | a lot | a lot | 6.19 | 7.53 | 11.77 | 3.48 | N/A |
h8_b8 | 15.8GB | 17.4GB | 6.24 | 6.94 | 11.58 | 3.44 | 3.13 |
h6_b5 | 11.1GB | 12.8GB | 6.26 | 6.93 | 11.26 | 3.45 | 3.13 |
h8_b3.5 | 8.9GB | 10.5GB | 6.54 | 6.94 | 11.31 | 3.52 | 3.26 |
h6_b3.5 | 8.9GB | 10.5GB | 6.57 | 6.94 | 11.71 | 3.52 | 3.26 |
h6_b3.5_default | 8.9GB | 10.5GB | 6.51 | 7.01 | 11.31 | 3.62 | 3.37 |
h6_b3.5_rand | 8.9GB | 10.5GB | 6.81 | 7.19 | 11.73 | 3.75 | 3.46 |
h6_b2.4 | 7.2GB | 8.9GB | 9.60 | 7.46 | 15.91 | 4.51 | 4.50 |
Information
All quantizations were measured
and quantized using the appropriate files.
These custom sets are a 1/3 1/3 1/3 mixture of wikitext2
, pippa
, and no robots
(herein "wtppnr") each with sizes adjusted to the default exl2 measurement and quantization lengths @ 4k
The quantization was done with default settings, 4096 rows and an NTK alpha of 2.5
VRAM estimates are performed with an extremely long chatlog in oobabooga webui on a 7900 XTX using nvtop to monitor pytorch usage only. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with flash attention 2 will use less VRAM than otherwise estimated.
The measurement files and parquets are provided in the main branch so you can make your own quants at other bit depths without going through 20-30 minutes of measuring
Analysis
I was curious about the effect a calibration dataset has on quantization after observing other huggingface users' quantizations outperforming their fp16 counterparts in given perplexity scores. This lead me to create the wtppnr calibration files in attempt to reduce overfit and benchmark contamination.
Running through the benchmarks we have the following:
wikitext
akawikitext-v2-raw
using a higher stride because I'm lazy. Hypothetically this is something like 0.2% contaminated as I used a few hundred kilobytes from it in wtppnrpippa1k
Is simply the first 1000 rows from the folded pippa chat parquet. This is 100% contaminated as the quantization set uses > 1000 rowswtppnr-test
Is the same size aswtppnr-measure
except indexing from the end to avoid data contaminationARMS
A private evaluation set consiting of a cherrypicking from my own logs, presented with full context as it would appear to the model during inferenceARMS 8k
ARMS again but evaluated with full 8k context. This dataset is small enough it only takes 5 minutes per model...
Update: Additionally, I re-made the h6 b3.5 with two different datasets: random tokens and the exl2 default. I also made 2.4b and 5b randoms, but the 2.4b exploded and the 5b wasn't much different from the 5b wtppnr.
So, the numbers...
...are all over the place.
wikitext
shows them in order of depth as expected.
pippa1k
clearly shows contamination as expected.
But then it gets weird with wtppnr-test
seemingly in an arbitrary order.
Why is 8 bit and fp16 in the middle? Who knows.
Maybe it's just variance from the good ole AMD Memory Access Faults™
Finally my homebrew ARMS dataset shows some interesting results, suggesting that it benefits from the fitting enough to offset the quantization losses for the first few bit levels.
Update: It seems the random token for maximum generalization theory is defunct as the model was just bad. The default however was close to my wtppnr dataset and actually surpassed it in full wikitext. I'm not even going to acknowledge it getting the highest wtppnr-test score as those numbers are just outright schizophrenic.
What are the conclusions?
To be honest, not many. The TL;DR is that advanced quantization quickly becomes a form of mini-training and should be treated as such.
- Pick and curate the calibration. Don't benchmark against possible contaminated sets.
- On paper the 8 vs 6 bit header doesn't seem that important despite what I've seen where users do 6.5bpw with 6 headers while others insist on using 8 headers all the way down to 2.5 bpw
- It's possible that overfitting is more of an issue on lower bit depths, leading to the jumbled mess that is the
wtppnr-test
set as all my calibrations were done with twice the default rows.- Very low bit depths are extremely sensitive in general. Notice all of the quant levels improved greatly at 8k context except the 2.4bpw. Update:
- Don't use random tokens even if it seems tempting.
- The builtin exl2 dataset looks pretty decent. It must be new, as I didn't even realize it existed until after I made wtppnr.
- Given how little difference datasets seem to make at 5bpw+, I would personally just use the exl2 default settings for 20B and smaller models.